mindcraft/docs/USER_GUIDE.md

# Mindcraft Evaluation System - User Guide

This guide provides instructions on how to use the updated evaluation system for Mindcraft tasks.

## Running an Evaluation with `evaluation_script.py`

The [`evaluation_script.py`](../tasks/evaluation_script.py:1) is the primary script for running task evaluations. It launches the necessary Minecraft servers and agents to perform the tasks defined in a given task file.

### Key Features

*   **Parallel Execution**: Run multiple experiments in parallel to speed up evaluation.
*   **Flexible Configuration**: Easily configure agent models, APIs, and other parameters through command-line arguments.
*   **Automatic Results Aggregation**: The script continuously monitors and aggregates results as experiments run.

### Usage

The script is run from the command line:

```bash
python tasks/evaluation_script.py [OPTIONS]
```

### Common Arguments

*   `--task_path`: Path to the JSON file containing task definitions (e.g., `tasks/multiagent_crafting_tasks.json`).
*   `--num_agents`: The number of agents to use for each task.
*   `--num_exp`: The number of times to repeat each task.
*   `--num_parallel`: The number of parallel servers to run for the evaluation.
*   `--exp_name`: A descriptive name for your experiment run.
*   `--model`: The model to use for the agents (e.g., `gpt-4o-mini`).
*   `--api`: The API to use (e.g., `openai`).
*   `--check`: Path to an existing experiment folder to re-evaluate results without running new experiments.

### Example

To run an experiment named `crafting_test` with 2 agents on the crafting tasks, using 4 parallel servers:

```bash
python tasks/evaluation_script.py \
    --task_path tasks/multiagent_crafting_tasks.json \
    --exp_name crafting_test \
    --num_agents 2 \
    --num_parallel 4
```

## Analyzing Results with `analyse_results.py`

Once an experiment is complete, you can use [`analyse_results.py`](../tasks/analyse_results.py:1) to perform a detailed analysis of the results.

### Features

*   **S3 Integration**: Download experiment results directly from an S3 bucket.
*   **Local Analysis**: Analyze results from a local directory.
*   **Detailed Reports**: Generates a CSV file with detailed metrics for each task run.

### Usage

```bash
python tasks/analyse_results.py [OPTIONS]
```

### Arguments

*   `--local_dir`: The local directory containing the experiment folders to analyze.
*   `--task_file_path`: Path to the original task definition file used for the experiment.
*   `--s3_download`: A flag to enable downloading results from S3.
*   `--aws_bucket_name`: The name of the S3 bucket.
*   `--s3_folder_prefix`: The folder prefix in the S3 bucket where results are stored.

### Example

To analyze the results from a local experiment folder:

```bash
python tasks/analyse_results.py \
    --local_dir experiments/crafting_test_06-15_21-38 \
    --task_file_path tasks/multiagent_crafting_tasks.json
```

## Understanding the Rich Output Format

The evaluation system produces two main output files in your experiment folder:

1.  `results.json`: A high-level summary of the experiment.
2.  `detailed_results.csv`: A detailed, row-per-task breakdown of the results.

### Key Columns in `detailed_results.csv`

*   **`task_id`**: The unique identifier for the task.
*   **`overall_is_successful`**: A boolean (`True`/`False`) indicating if the task was completed successfully.
*   **`overall_completion_status`**: A more granular status of the task outcome. See [`CompletionStatus`](../tasks/evaluation.py:11) for possible values:
    *   `SUCCESS`: The task was completed successfully.
    *   `FAILED_SCORE_ZERO`: The task failed with a score of 0.
    *   `FAILED_PARTIAL_SCORE`: The task failed but achieved a partial score.
    *   `TIMED_OUT`: The task failed due to a timeout.
    *   `NO_SCORE_LOGGED`: No score was recorded for the task.
    *   `LOG_FILE_ERROR`: An error occurred while processing the agent's log file.
*   **`overall_raw_score`**: The highest score achieved by any agent for the task.
*   **`metric_*`**: A set of columns prefixed with `metric_` that contain difficulty metrics from the task definition file.

## Migration Guide

Migrating from the old evaluation system to the new one is straightforward:

1.  **Use the new scripts**: Use [`evaluation_script.py`](../tasks/evaluation_script.py:1) to run experiments and [`analyse_results.py`](../tasks/analyse_results.py:1) for analysis.
2.  **Familiarize yourself with the new output**: The primary output is now the `detailed_results.csv` file. The analysis logic that was previously scattered in various scripts is now centralized and produces this single, comprehensive report.
3.  **Leverage the new features**: Take advantage of parallel execution and simplified configuration to run your evaluations more efficiently.