mindcraft/docs/DEVELOPER_GUIDE.md

# Mindcraft Evaluation System - Developer Guide

This guide provides technical documentation for developers working with the Mindcraft evaluation system.

## Architecture Overview

The new evaluation module is designed to be modular and extensible. The core components are:

*   **`evaluation_script.py`**: The main entry point for running experiments. It handles setting up the environment, launching servers and agents, and collecting results.
*   **`evaluation.py`**: This module contains the core logic for analyzing and evaluating task outcomes. It defines the data structures for representing results and provides functions for extracting and aggregating them.
*   **`analyse_results.py`**: A script for post-experiment analysis. It can download results from S3, process them using the `evaluation.py` module, and generate detailed reports.

The data flow is as follows:

1.  [`evaluation_script.py`](../tasks/evaluation_script.py:1) runs the experiments and generates raw JSON log files for each agent in an experiment folder.
2.  During or after the experiment, [`evaluation_script.py`](../tasks/evaluation_script.py:1) or [`analyse_results.py`](../tasks/analyse_results.py:1) is used to process these logs.
3.  For each task folder, [`extract_task_outcome()`](../tasks/evaluation.py:113) is called.
4.  [`extract_task_outcome()`](../tasks/evaluation.py:113) calls [`analyze_agent_log()`](../tasks/evaluation.py:47) for each agent's log file to get an [`AgentOutcome`](../tasks/evaluation.py:21).
5.  The individual [`AgentOutcome`](../tasks/evaluation.py:21) objects are aggregated into a single [`TaskRunOutcome`](../tasks/evaluation.py:31).
6.  Finally, all [`TaskRunOutcome`](../tasks/evaluation.py:31) objects are converted into a Pandas DataFrame by [`aggregate_results_to_dataframe()`](../tasks/evaluation.py:170) for easy analysis and reporting.

## API Documentation for `tasks/evaluation.py`

The [`tasks/evaluation.py`](../tasks/evaluation.py:1) module provides the core functions for evaluating task results.

### `analyze_agent_log(file_path: str) -> AgentOutcome`

*   **Description**: Analyzes a single agent's JSON log file. It extracts the score, timeout status, and final system message.
*   **Arguments**:
    *   `file_path` (str): The path to the agent's log file.
*   **Returns**: An [`AgentOutcome`](#agentoutcome) data class containing the results for a single agent.

### `extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> TaskRunOutcome`

*   **Description**: Orchestrates the analysis of a single task run folder. It finds all agent logs, calls `analyze_agent_log` for each, and aggregates the results.
*   **Arguments**:
    *   `folder_path` (str): The path to the folder containing the agent logs for a single task run.
    *   `task_definition` (dict): The definition of the task, used to enrich the results with metadata.
*   **Returns**: A [`TaskRunOutcome`](#taskrunoutcome) data class containing the aggregated results for the task run.

### `aggregate_results_to_dataframe(task_outcomes: List[TaskRunOutcome]) -> pd.DataFrame`

*   **Description**: Converts a list of `TaskRunOutcome` objects into a Pandas DataFrame, which is used for all further analysis and reporting.
*   **Arguments**:
    *   `task_outcomes` (list): A list of `TaskRunOutcome` objects.
*   **Returns**: A `pd.DataFrame` with the flattened and aggregated results.

## Data Structure Specifications

The evaluation system uses two primary data classes to structure the results:

### `AgentOutcome`

Defined in [`tasks/evaluation.py`](../tasks/evaluation.py:21), this data class holds the results for a single agent's participation in a task.

| Field                 | Type                     | Description                                            |
| --------------------- | ------------------------ | ------------------------------------------------------ |
| `raw_score`           | `float`                  | The numerical score achieved by the agent.             |
| `completion_status`   | [`CompletionStatus`](#completionstatus) | The granular status of the agent's task attempt.       |
| `final_system_message`| `str`                    | The final system message from the log.                 |
| `agent_log_processed` | `bool`                   | Whether the agent's log was successfully processed.    |
| `parsing_errors`      | `List[str]`              | A list of any errors encountered during parsing.       |
| `timed_out`           | `bool`                   | `True` if the agent timed out.                         |

### `TaskRunOutcome`

Defined in [`tasks/evaluation.py`](../tasks/evaluation.py:31), this data class aggregates the outcomes from all agents involved in a single task run.

| Field                         | Type                  | Description                                                  |
| ----------------------------- | --------------------- | ------------------------------------------------------------ |
| `task_id`                     | `str`                 | The unique identifier for the task.                          |
| `model_name`                  | `str`                 | The name of the model used.                                  |
| `agent_count`                 | `int`                 | The number of agents that participated in the task.          |
| `task_type`                   | `str`                 | The type of the task (e.g., `cooking`, `crafting`).          |
| `overall_raw_score`           | `float`               | The highest score achieved among all agents.                 |
| `overall_is_successful`       | `bool`                | `True` if the task was successfully completed by any agent.  |
| `overall_completion_status`   | [`CompletionStatus`](#completionstatus) | The aggregated completion status for the entire task.      |
| `total_agent_logs_found`      | `int`                 | The number of agent log files found and processed.           |
| `agent_outcomes`              | `List[AgentOutcome]`  | A list of `AgentOutcome` objects for each agent.             |
| `task_definition_metrics`     | `Dict[str, Any]`      | A dictionary of metrics from the task definition file.       |

### `CompletionStatus`

This `Enum`, defined in [`tasks/evaluation.py`](../tasks/evaluation.py:11), provides a standardized set of outcomes for a task.

*   `SUCCESS`
*   `FAILED_SCORE_ZERO`
*   `FAILED_PARTIAL_SCORE`
*   `TIMED_OUT`
*   `NO_SCORE_LOGGED`
*   `LOG_FILE_ERROR`

## Extension Points for Custom Analysis

The new system is designed to be easily extended. The primary extension point is the final DataFrame generated by [`aggregate_results_to_dataframe()`](../tasks/evaluation.py:170).

Since all the detailed results are available in a structured DataFrame, you can easily perform custom analysis using the full power of the Pandas library. You can write your own scripts to:

*   Load the `detailed_results.csv` file.
*   Perform custom aggregations, filtering, and statistical analysis.
*   Generate new plots and visualizations.
*   Correlate evaluation results with other data sources.