mindcraft/docs/DEVELOPER_GUIDE.md
Johnathan Walker cc51242527 feat: Enhanced task evaluation system with flexible agent support and rich outcome reporting
- Added new evaluation.py with dynamic agent configuration support
- Implemented comprehensive test suite (38 tests, 100% pass rate)
- Enhanced evaluation_script.py with improved error handling and logging
- Updated analysis tools for better outcome reporting and visualization
- Added extensive documentation including architecture guide and user manuals
- Maintained backward compatibility with existing task formats
- Improved performance and reliability for multi-agent evaluations

Key improvements:
- Flexible agent count configuration (1-N agents)
- Rich outcome data structures with detailed metrics
- Comprehensive error handling and recovery mechanisms
- Enhanced logging and debugging capabilities
- Complete test coverage for production readiness

Files added/modified:
- tasks/evaluation.py (new core evaluation engine)
- tasks/test_*.py (comprehensive test suite)
- docs/ (complete documentation suite)
- Updated analysis and visualization tools
2025-06-15 22:01:19 -04:00

102 lines
No EOL
7.1 KiB
Markdown

# Mindcraft Evaluation System - Developer Guide
This guide provides technical documentation for developers working with the Mindcraft evaluation system.
## Architecture Overview
The new evaluation module is designed to be modular and extensible. The core components are:
* **`evaluation_script.py`**: The main entry point for running experiments. It handles setting up the environment, launching servers and agents, and collecting results.
* **`evaluation.py`**: This module contains the core logic for analyzing and evaluating task outcomes. It defines the data structures for representing results and provides functions for extracting and aggregating them.
* **`analyse_results.py`**: A script for post-experiment analysis. It can download results from S3, process them using the `evaluation.py` module, and generate detailed reports.
The data flow is as follows:
1. [`evaluation_script.py`](../tasks/evaluation_script.py:1) runs the experiments and generates raw JSON log files for each agent in an experiment folder.
2. During or after the experiment, [`evaluation_script.py`](../tasks/evaluation_script.py:1) or [`analyse_results.py`](../tasks/analyse_results.py:1) is used to process these logs.
3. For each task folder, [`extract_task_outcome()`](../tasks/evaluation.py:113) is called.
4. [`extract_task_outcome()`](../tasks/evaluation.py:113) calls [`analyze_agent_log()`](../tasks/evaluation.py:47) for each agent's log file to get an [`AgentOutcome`](../tasks/evaluation.py:21).
5. The individual [`AgentOutcome`](../tasks/evaluation.py:21) objects are aggregated into a single [`TaskRunOutcome`](../tasks/evaluation.py:31).
6. Finally, all [`TaskRunOutcome`](../tasks/evaluation.py:31) objects are converted into a Pandas DataFrame by [`aggregate_results_to_dataframe()`](../tasks/evaluation.py:170) for easy analysis and reporting.
## API Documentation for `tasks/evaluation.py`
The [`tasks/evaluation.py`](../tasks/evaluation.py:1) module provides the core functions for evaluating task results.
### `analyze_agent_log(file_path: str) -> AgentOutcome`
* **Description**: Analyzes a single agent's JSON log file. It extracts the score, timeout status, and final system message.
* **Arguments**:
* `file_path` (str): The path to the agent's log file.
* **Returns**: An [`AgentOutcome`](#agentoutcome) data class containing the results for a single agent.
### `extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> TaskRunOutcome`
* **Description**: Orchestrates the analysis of a single task run folder. It finds all agent logs, calls `analyze_agent_log` for each, and aggregates the results.
* **Arguments**:
* `folder_path` (str): The path to the folder containing the agent logs for a single task run.
* `task_definition` (dict): The definition of the task, used to enrich the results with metadata.
* **Returns**: A [`TaskRunOutcome`](#taskrunoutcome) data class containing the aggregated results for the task run.
### `aggregate_results_to_dataframe(task_outcomes: List[TaskRunOutcome]) -> pd.DataFrame`
* **Description**: Converts a list of `TaskRunOutcome` objects into a Pandas DataFrame, which is used for all further analysis and reporting.
* **Arguments**:
* `task_outcomes` (list): A list of `TaskRunOutcome` objects.
* **Returns**: A `pd.DataFrame` with the flattened and aggregated results.
## Data Structure Specifications
The evaluation system uses two primary data classes to structure the results:
### `AgentOutcome`
Defined in [`tasks/evaluation.py`](../tasks/evaluation.py:21), this data class holds the results for a single agent's participation in a task.
| Field | Type | Description |
| --------------------- | ------------------------ | ------------------------------------------------------ |
| `raw_score` | `float` | The numerical score achieved by the agent. |
| `completion_status` | [`CompletionStatus`](#completionstatus) | The granular status of the agent's task attempt. |
| `final_system_message`| `str` | The final system message from the log. |
| `agent_log_processed` | `bool` | Whether the agent's log was successfully processed. |
| `parsing_errors` | `List[str]` | A list of any errors encountered during parsing. |
| `timed_out` | `bool` | `True` if the agent timed out. |
### `TaskRunOutcome`
Defined in [`tasks/evaluation.py`](../tasks/evaluation.py:31), this data class aggregates the outcomes from all agents involved in a single task run.
| Field | Type | Description |
| ----------------------------- | --------------------- | ------------------------------------------------------------ |
| `task_id` | `str` | The unique identifier for the task. |
| `model_name` | `str` | The name of the model used. |
| `agent_count` | `int` | The number of agents that participated in the task. |
| `task_type` | `str` | The type of the task (e.g., `cooking`, `crafting`). |
| `overall_raw_score` | `float` | The highest score achieved among all agents. |
| `overall_is_successful` | `bool` | `True` if the task was successfully completed by any agent. |
| `overall_completion_status` | [`CompletionStatus`](#completionstatus) | The aggregated completion status for the entire task. |
| `total_agent_logs_found` | `int` | The number of agent log files found and processed. |
| `agent_outcomes` | `List[AgentOutcome]` | A list of `AgentOutcome` objects for each agent. |
| `task_definition_metrics` | `Dict[str, Any]` | A dictionary of metrics from the task definition file. |
### `CompletionStatus`
This `Enum`, defined in [`tasks/evaluation.py`](../tasks/evaluation.py:11), provides a standardized set of outcomes for a task.
* `SUCCESS`
* `FAILED_SCORE_ZERO`
* `FAILED_PARTIAL_SCORE`
* `TIMED_OUT`
* `NO_SCORE_LOGGED`
* `LOG_FILE_ERROR`
## Extension Points for Custom Analysis
The new system is designed to be easily extended. The primary extension point is the final DataFrame generated by [`aggregate_results_to_dataframe()`](../tasks/evaluation.py:170).
Since all the detailed results are available in a structured DataFrame, you can easily perform custom analysis using the full power of the Pandas library. You can write your own scripts to:
* Load the `detailed_results.csv` file.
* Perform custom aggregations, filtering, and statistical analysis.
* Generate new plots and visualizations.
* Correlate evaluation results with other data sources.