mirror of
https://github.com/kolbytn/mindcraft.git
synced 2025-07-28 02:45:27 +02:00

- Added new evaluation.py with dynamic agent configuration support - Implemented comprehensive test suite (38 tests, 100% pass rate) - Enhanced evaluation_script.py with improved error handling and logging - Updated analysis tools for better outcome reporting and visualization - Added extensive documentation including architecture guide and user manuals - Maintained backward compatibility with existing task formats - Improved performance and reliability for multi-agent evaluations Key improvements: - Flexible agent count configuration (1-N agents) - Rich outcome data structures with detailed metrics - Comprehensive error handling and recovery mechanisms - Enhanced logging and debugging capabilities - Complete test coverage for production readiness Files added/modified: - tasks/evaluation.py (new core evaluation engine) - tasks/test_*.py (comprehensive test suite) - docs/ (complete documentation suite) - Updated analysis and visualization tools
107 lines
No EOL
4.8 KiB
Markdown
107 lines
No EOL
4.8 KiB
Markdown
# Mindcraft Evaluation System - User Guide
|
|
|
|
This guide provides instructions on how to use the updated evaluation system for Mindcraft tasks.
|
|
|
|
## Running an Evaluation with `evaluation_script.py`
|
|
|
|
The [`evaluation_script.py`](../tasks/evaluation_script.py:1) is the primary script for running task evaluations. It launches the necessary Minecraft servers and agents to perform the tasks defined in a given task file.
|
|
|
|
### Key Features
|
|
|
|
* **Parallel Execution**: Run multiple experiments in parallel to speed up evaluation.
|
|
* **Flexible Configuration**: Easily configure agent models, APIs, and other parameters through command-line arguments.
|
|
* **Automatic Results Aggregation**: The script continuously monitors and aggregates results as experiments run.
|
|
|
|
### Usage
|
|
|
|
The script is run from the command line:
|
|
|
|
```bash
|
|
python tasks/evaluation_script.py [OPTIONS]
|
|
```
|
|
|
|
### Common Arguments
|
|
|
|
* `--task_path`: Path to the JSON file containing task definitions (e.g., `tasks/multiagent_crafting_tasks.json`).
|
|
* `--num_agents`: The number of agents to use for each task.
|
|
* `--num_exp`: The number of times to repeat each task.
|
|
* `--num_parallel`: The number of parallel servers to run for the evaluation.
|
|
* `--exp_name`: A descriptive name for your experiment run.
|
|
* `--model`: The model to use for the agents (e.g., `gpt-4o-mini`).
|
|
* `--api`: The API to use (e.g., `openai`).
|
|
* `--check`: Path to an existing experiment folder to re-evaluate results without running new experiments.
|
|
|
|
### Example
|
|
|
|
To run an experiment named `crafting_test` with 2 agents on the crafting tasks, using 4 parallel servers:
|
|
|
|
```bash
|
|
python tasks/evaluation_script.py \
|
|
--task_path tasks/multiagent_crafting_tasks.json \
|
|
--exp_name crafting_test \
|
|
--num_agents 2 \
|
|
--num_parallel 4
|
|
```
|
|
|
|
## Analyzing Results with `analyse_results.py`
|
|
|
|
Once an experiment is complete, you can use [`analyse_results.py`](../tasks/analyse_results.py:1) to perform a detailed analysis of the results.
|
|
|
|
### Features
|
|
|
|
* **S3 Integration**: Download experiment results directly from an S3 bucket.
|
|
* **Local Analysis**: Analyze results from a local directory.
|
|
* **Detailed Reports**: Generates a CSV file with detailed metrics for each task run.
|
|
|
|
### Usage
|
|
|
|
```bash
|
|
python tasks/analyse_results.py [OPTIONS]
|
|
```
|
|
|
|
### Arguments
|
|
|
|
* `--local_dir`: The local directory containing the experiment folders to analyze.
|
|
* `--task_file_path`: Path to the original task definition file used for the experiment.
|
|
* `--s3_download`: A flag to enable downloading results from S3.
|
|
* `--aws_bucket_name`: The name of the S3 bucket.
|
|
* `--s3_folder_prefix`: The folder prefix in the S3 bucket where results are stored.
|
|
|
|
### Example
|
|
|
|
To analyze the results from a local experiment folder:
|
|
|
|
```bash
|
|
python tasks/analyse_results.py \
|
|
--local_dir experiments/crafting_test_06-15_21-38 \
|
|
--task_file_path tasks/multiagent_crafting_tasks.json
|
|
```
|
|
|
|
## Understanding the Rich Output Format
|
|
|
|
The evaluation system produces two main output files in your experiment folder:
|
|
|
|
1. `results.json`: A high-level summary of the experiment.
|
|
2. `detailed_results.csv`: A detailed, row-per-task breakdown of the results.
|
|
|
|
### Key Columns in `detailed_results.csv`
|
|
|
|
* **`task_id`**: The unique identifier for the task.
|
|
* **`overall_is_successful`**: A boolean (`True`/`False`) indicating if the task was completed successfully.
|
|
* **`overall_completion_status`**: A more granular status of the task outcome. See [`CompletionStatus`](../tasks/evaluation.py:11) for possible values:
|
|
* `SUCCESS`: The task was completed successfully.
|
|
* `FAILED_SCORE_ZERO`: The task failed with a score of 0.
|
|
* `FAILED_PARTIAL_SCORE`: The task failed but achieved a partial score.
|
|
* `TIMED_OUT`: The task failed due to a timeout.
|
|
* `NO_SCORE_LOGGED`: No score was recorded for the task.
|
|
* `LOG_FILE_ERROR`: An error occurred while processing the agent's log file.
|
|
* **`overall_raw_score`**: The highest score achieved by any agent for the task.
|
|
* **`metric_*`**: A set of columns prefixed with `metric_` that contain difficulty metrics from the task definition file.
|
|
|
|
## Migration Guide
|
|
|
|
Migrating from the old evaluation system to the new one is straightforward:
|
|
|
|
1. **Use the new scripts**: Use [`evaluation_script.py`](../tasks/evaluation_script.py:1) to run experiments and [`analyse_results.py`](../tasks/analyse_results.py:1) for analysis.
|
|
2. **Familiarize yourself with the new output**: The primary output is now the `detailed_results.csv` file. The analysis logic that was previously scattered in various scripts is now centralized and produces this single, comprehensive report.
|
|
3. **Leverage the new features**: Take advantage of parallel execution and simplified configuration to run your evaluations more efficiently. |