mirror of
https://github.com/kolbytn/mindcraft.git
synced 2025-07-31 20:35:31 +02:00

- Added new evaluation.py with dynamic agent configuration support - Implemented comprehensive test suite (38 tests, 100% pass rate) - Enhanced evaluation_script.py with improved error handling and logging - Updated analysis tools for better outcome reporting and visualization - Added extensive documentation including architecture guide and user manuals - Maintained backward compatibility with existing task formats - Improved performance and reliability for multi-agent evaluations Key improvements: - Flexible agent count configuration (1-N agents) - Rich outcome data structures with detailed metrics - Comprehensive error handling and recovery mechanisms - Enhanced logging and debugging capabilities - Complete test coverage for production readiness Files added/modified: - tasks/evaluation.py (new core evaluation engine) - tasks/test_*.py (comprehensive test suite) - docs/ (complete documentation suite) - Updated analysis and visualization tools
102 lines
No EOL
7.1 KiB
Markdown
102 lines
No EOL
7.1 KiB
Markdown
# Mindcraft Evaluation System - Developer Guide
|
|
|
|
This guide provides technical documentation for developers working with the Mindcraft evaluation system.
|
|
|
|
## Architecture Overview
|
|
|
|
The new evaluation module is designed to be modular and extensible. The core components are:
|
|
|
|
* **`evaluation_script.py`**: The main entry point for running experiments. It handles setting up the environment, launching servers and agents, and collecting results.
|
|
* **`evaluation.py`**: This module contains the core logic for analyzing and evaluating task outcomes. It defines the data structures for representing results and provides functions for extracting and aggregating them.
|
|
* **`analyse_results.py`**: A script for post-experiment analysis. It can download results from S3, process them using the `evaluation.py` module, and generate detailed reports.
|
|
|
|
The data flow is as follows:
|
|
|
|
1. [`evaluation_script.py`](../tasks/evaluation_script.py:1) runs the experiments and generates raw JSON log files for each agent in an experiment folder.
|
|
2. During or after the experiment, [`evaluation_script.py`](../tasks/evaluation_script.py:1) or [`analyse_results.py`](../tasks/analyse_results.py:1) is used to process these logs.
|
|
3. For each task folder, [`extract_task_outcome()`](../tasks/evaluation.py:113) is called.
|
|
4. [`extract_task_outcome()`](../tasks/evaluation.py:113) calls [`analyze_agent_log()`](../tasks/evaluation.py:47) for each agent's log file to get an [`AgentOutcome`](../tasks/evaluation.py:21).
|
|
5. The individual [`AgentOutcome`](../tasks/evaluation.py:21) objects are aggregated into a single [`TaskRunOutcome`](../tasks/evaluation.py:31).
|
|
6. Finally, all [`TaskRunOutcome`](../tasks/evaluation.py:31) objects are converted into a Pandas DataFrame by [`aggregate_results_to_dataframe()`](../tasks/evaluation.py:170) for easy analysis and reporting.
|
|
|
|
## API Documentation for `tasks/evaluation.py`
|
|
|
|
The [`tasks/evaluation.py`](../tasks/evaluation.py:1) module provides the core functions for evaluating task results.
|
|
|
|
### `analyze_agent_log(file_path: str) -> AgentOutcome`
|
|
|
|
* **Description**: Analyzes a single agent's JSON log file. It extracts the score, timeout status, and final system message.
|
|
* **Arguments**:
|
|
* `file_path` (str): The path to the agent's log file.
|
|
* **Returns**: An [`AgentOutcome`](#agentoutcome) data class containing the results for a single agent.
|
|
|
|
### `extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> TaskRunOutcome`
|
|
|
|
* **Description**: Orchestrates the analysis of a single task run folder. It finds all agent logs, calls `analyze_agent_log` for each, and aggregates the results.
|
|
* **Arguments**:
|
|
* `folder_path` (str): The path to the folder containing the agent logs for a single task run.
|
|
* `task_definition` (dict): The definition of the task, used to enrich the results with metadata.
|
|
* **Returns**: A [`TaskRunOutcome`](#taskrunoutcome) data class containing the aggregated results for the task run.
|
|
|
|
### `aggregate_results_to_dataframe(task_outcomes: List[TaskRunOutcome]) -> pd.DataFrame`
|
|
|
|
* **Description**: Converts a list of `TaskRunOutcome` objects into a Pandas DataFrame, which is used for all further analysis and reporting.
|
|
* **Arguments**:
|
|
* `task_outcomes` (list): A list of `TaskRunOutcome` objects.
|
|
* **Returns**: A `pd.DataFrame` with the flattened and aggregated results.
|
|
|
|
## Data Structure Specifications
|
|
|
|
The evaluation system uses two primary data classes to structure the results:
|
|
|
|
### `AgentOutcome`
|
|
|
|
Defined in [`tasks/evaluation.py`](../tasks/evaluation.py:21), this data class holds the results for a single agent's participation in a task.
|
|
|
|
| Field | Type | Description |
|
|
| --------------------- | ------------------------ | ------------------------------------------------------ |
|
|
| `raw_score` | `float` | The numerical score achieved by the agent. |
|
|
| `completion_status` | [`CompletionStatus`](#completionstatus) | The granular status of the agent's task attempt. |
|
|
| `final_system_message`| `str` | The final system message from the log. |
|
|
| `agent_log_processed` | `bool` | Whether the agent's log was successfully processed. |
|
|
| `parsing_errors` | `List[str]` | A list of any errors encountered during parsing. |
|
|
| `timed_out` | `bool` | `True` if the agent timed out. |
|
|
|
|
### `TaskRunOutcome`
|
|
|
|
Defined in [`tasks/evaluation.py`](../tasks/evaluation.py:31), this data class aggregates the outcomes from all agents involved in a single task run.
|
|
|
|
| Field | Type | Description |
|
|
| ----------------------------- | --------------------- | ------------------------------------------------------------ |
|
|
| `task_id` | `str` | The unique identifier for the task. |
|
|
| `model_name` | `str` | The name of the model used. |
|
|
| `agent_count` | `int` | The number of agents that participated in the task. |
|
|
| `task_type` | `str` | The type of the task (e.g., `cooking`, `crafting`). |
|
|
| `overall_raw_score` | `float` | The highest score achieved among all agents. |
|
|
| `overall_is_successful` | `bool` | `True` if the task was successfully completed by any agent. |
|
|
| `overall_completion_status` | [`CompletionStatus`](#completionstatus) | The aggregated completion status for the entire task. |
|
|
| `total_agent_logs_found` | `int` | The number of agent log files found and processed. |
|
|
| `agent_outcomes` | `List[AgentOutcome]` | A list of `AgentOutcome` objects for each agent. |
|
|
| `task_definition_metrics` | `Dict[str, Any]` | A dictionary of metrics from the task definition file. |
|
|
|
|
### `CompletionStatus`
|
|
|
|
This `Enum`, defined in [`tasks/evaluation.py`](../tasks/evaluation.py:11), provides a standardized set of outcomes for a task.
|
|
|
|
* `SUCCESS`
|
|
* `FAILED_SCORE_ZERO`
|
|
* `FAILED_PARTIAL_SCORE`
|
|
* `TIMED_OUT`
|
|
* `NO_SCORE_LOGGED`
|
|
* `LOG_FILE_ERROR`
|
|
|
|
## Extension Points for Custom Analysis
|
|
|
|
The new system is designed to be easily extended. The primary extension point is the final DataFrame generated by [`aggregate_results_to_dataframe()`](../tasks/evaluation.py:170).
|
|
|
|
Since all the detailed results are available in a structured DataFrame, you can easily perform custom analysis using the full power of the Pandas library. You can write your own scripts to:
|
|
|
|
* Load the `detailed_results.csv` file.
|
|
* Perform custom aggregations, filtering, and statistical analysis.
|
|
* Generate new plots and visualizations.
|
|
* Correlate evaluation results with other data sources. |