mirror of
https://github.com/kolbytn/mindcraft.git
synced 2025-07-27 18:35:27 +02:00

- Added new evaluation.py with dynamic agent configuration support - Implemented comprehensive test suite (38 tests, 100% pass rate) - Enhanced evaluation_script.py with improved error handling and logging - Updated analysis tools for better outcome reporting and visualization - Added extensive documentation including architecture guide and user manuals - Maintained backward compatibility with existing task formats - Improved performance and reliability for multi-agent evaluations Key improvements: - Flexible agent count configuration (1-N agents) - Rich outcome data structures with detailed metrics - Comprehensive error handling and recovery mechanisms - Enhanced logging and debugging capabilities - Complete test coverage for production readiness Files added/modified: - tasks/evaluation.py (new core evaluation engine) - tasks/test_*.py (comprehensive test suite) - docs/ (complete documentation suite) - Updated analysis and visualization tools
4.8 KiB
4.8 KiB
Mindcraft Evaluation System - User Guide
This guide provides instructions on how to use the updated evaluation system for Mindcraft tasks.
Running an Evaluation with evaluation_script.py
The evaluation_script.py
is the primary script for running task evaluations. It launches the necessary Minecraft servers and agents to perform the tasks defined in a given task file.
Key Features
- Parallel Execution: Run multiple experiments in parallel to speed up evaluation.
- Flexible Configuration: Easily configure agent models, APIs, and other parameters through command-line arguments.
- Automatic Results Aggregation: The script continuously monitors and aggregates results as experiments run.
Usage
The script is run from the command line:
python tasks/evaluation_script.py [OPTIONS]
Common Arguments
--task_path
: Path to the JSON file containing task definitions (e.g.,tasks/multiagent_crafting_tasks.json
).--num_agents
: The number of agents to use for each task.--num_exp
: The number of times to repeat each task.--num_parallel
: The number of parallel servers to run for the evaluation.--exp_name
: A descriptive name for your experiment run.--model
: The model to use for the agents (e.g.,gpt-4o-mini
).--api
: The API to use (e.g.,openai
).--check
: Path to an existing experiment folder to re-evaluate results without running new experiments.
Example
To run an experiment named crafting_test
with 2 agents on the crafting tasks, using 4 parallel servers:
python tasks/evaluation_script.py \
--task_path tasks/multiagent_crafting_tasks.json \
--exp_name crafting_test \
--num_agents 2 \
--num_parallel 4
Analyzing Results with analyse_results.py
Once an experiment is complete, you can use analyse_results.py
to perform a detailed analysis of the results.
Features
- S3 Integration: Download experiment results directly from an S3 bucket.
- Local Analysis: Analyze results from a local directory.
- Detailed Reports: Generates a CSV file with detailed metrics for each task run.
Usage
python tasks/analyse_results.py [OPTIONS]
Arguments
--local_dir
: The local directory containing the experiment folders to analyze.--task_file_path
: Path to the original task definition file used for the experiment.--s3_download
: A flag to enable downloading results from S3.--aws_bucket_name
: The name of the S3 bucket.--s3_folder_prefix
: The folder prefix in the S3 bucket where results are stored.
Example
To analyze the results from a local experiment folder:
python tasks/analyse_results.py \
--local_dir experiments/crafting_test_06-15_21-38 \
--task_file_path tasks/multiagent_crafting_tasks.json
Understanding the Rich Output Format
The evaluation system produces two main output files in your experiment folder:
results.json
: A high-level summary of the experiment.detailed_results.csv
: A detailed, row-per-task breakdown of the results.
Key Columns in detailed_results.csv
task_id
: The unique identifier for the task.overall_is_successful
: A boolean (True
/False
) indicating if the task was completed successfully.overall_completion_status
: A more granular status of the task outcome. SeeCompletionStatus
for possible values:SUCCESS
: The task was completed successfully.FAILED_SCORE_ZERO
: The task failed with a score of 0.FAILED_PARTIAL_SCORE
: The task failed but achieved a partial score.TIMED_OUT
: The task failed due to a timeout.NO_SCORE_LOGGED
: No score was recorded for the task.LOG_FILE_ERROR
: An error occurred while processing the agent's log file.
overall_raw_score
: The highest score achieved by any agent for the task.metric_*
: A set of columns prefixed withmetric_
that contain difficulty metrics from the task definition file.
Migration Guide
Migrating from the old evaluation system to the new one is straightforward:
- Use the new scripts: Use
evaluation_script.py
to run experiments andanalyse_results.py
for analysis. - Familiarize yourself with the new output: The primary output is now the
detailed_results.csv
file. The analysis logic that was previously scattered in various scripts is now centralized and produces this single, comprehensive report. - Leverage the new features: Take advantage of parallel execution and simplified configuration to run your evaluations more efficiently.