mirror of https://github.com/kolbytn/mindcraft.git synced 2025-07-27 18:35:27 +02:00

Johnathan Walker cc51242527 feat: Enhanced task evaluation system with flexible agent support and rich outcome reporting

- Added new evaluation.py with dynamic agent configuration support
- Implemented comprehensive test suite (38 tests, 100% pass rate)
- Enhanced evaluation_script.py with improved error handling and logging
- Updated analysis tools for better outcome reporting and visualization
- Added extensive documentation including architecture guide and user manuals
- Maintained backward compatibility with existing task formats
- Improved performance and reliability for multi-agent evaluations

Key improvements:
- Flexible agent count configuration (1-N agents)
- Rich outcome data structures with detailed metrics
- Comprehensive error handling and recovery mechanisms
- Enhanced logging and debugging capabilities
- Complete test coverage for production readiness

Files added/modified:
- tasks/evaluation.py (new core evaluation engine)
- tasks/test_*.py (comprehensive test suite)
- docs/ (complete documentation suite)
- Updated analysis and visualization tools

2025-06-15 22:01:19 -04:00

4.8 KiB

Raw Blame History

Mindcraft Evaluation System - User Guide

This guide provides instructions on how to use the updated evaluation system for Mindcraft tasks.

Running an Evaluation with `evaluation_script.py`

The evaluation_script.py is the primary script for running task evaluations. It launches the necessary Minecraft servers and agents to perform the tasks defined in a given task file.

Key Features

Parallel Execution: Run multiple experiments in parallel to speed up evaluation.
Flexible Configuration: Easily configure agent models, APIs, and other parameters through command-line arguments.
Automatic Results Aggregation: The script continuously monitors and aggregates results as experiments run.

Usage

The script is run from the command line:

python tasks/evaluation_script.py [OPTIONS]

Common Arguments

--task_path: Path to the JSON file containing task definitions (e.g., tasks/multiagent_crafting_tasks.json).
--num_agents: The number of agents to use for each task.
--num_exp: The number of times to repeat each task.
--num_parallel: The number of parallel servers to run for the evaluation.
--exp_name: A descriptive name for your experiment run.
--model: The model to use for the agents (e.g., gpt-4o-mini).
--api: The API to use (e.g., openai).
--check: Path to an existing experiment folder to re-evaluate results without running new experiments.

Example

To run an experiment named crafting_test with 2 agents on the crafting tasks, using 4 parallel servers:

python tasks/evaluation_script.py \
    --task_path tasks/multiagent_crafting_tasks.json \
    --exp_name crafting_test \
    --num_agents 2 \
    --num_parallel 4

Analyzing Results with `analyse_results.py`

Once an experiment is complete, you can use analyse_results.py to perform a detailed analysis of the results.

Features

S3 Integration: Download experiment results directly from an S3 bucket.
Local Analysis: Analyze results from a local directory.
Detailed Reports: Generates a CSV file with detailed metrics for each task run.

Usage

python tasks/analyse_results.py [OPTIONS]

Arguments

--local_dir: The local directory containing the experiment folders to analyze.
--task_file_path: Path to the original task definition file used for the experiment.
--s3_download: A flag to enable downloading results from S3.
--aws_bucket_name: The name of the S3 bucket.
--s3_folder_prefix: The folder prefix in the S3 bucket where results are stored.

Example

To analyze the results from a local experiment folder:

python tasks/analyse_results.py \
    --local_dir experiments/crafting_test_06-15_21-38 \
    --task_file_path tasks/multiagent_crafting_tasks.json

Understanding the Rich Output Format

The evaluation system produces two main output files in your experiment folder:

results.json: A high-level summary of the experiment.
detailed_results.csv: A detailed, row-per-task breakdown of the results.

Key Columns in `detailed_results.csv`

task_id: The unique identifier for the task.
overall_is_successful: A boolean (True/False) indicating if the task was completed successfully.
overall_completion_status: A more granular status of the task outcome. See CompletionStatus for possible values:
- SUCCESS: The task was completed successfully.
- FAILED_SCORE_ZERO: The task failed with a score of 0.
- FAILED_PARTIAL_SCORE: The task failed but achieved a partial score.
- TIMED_OUT: The task failed due to a timeout.
- NO_SCORE_LOGGED: No score was recorded for the task.
- LOG_FILE_ERROR: An error occurred while processing the agent's log file.
overall_raw_score: The highest score achieved by any agent for the task.
metric_*: A set of columns prefixed with metric_ that contain difficulty metrics from the task definition file.

Migration Guide

Migrating from the old evaluation system to the new one is straightforward:

Use the new scripts: Use evaluation_script.py to run experiments and analyse_results.py for analysis.
Familiarize yourself with the new output: The primary output is now the detailed_results.csv file. The analysis logic that was previously scattered in various scripts is now centralized and produces this single, comprehensive report.
Leverage the new features: Take advantage of parallel execution and simplified configuration to run your evaluations more efficiently.

4.8 KiB Raw Blame History

Mindcraft Evaluation System - User Guide

Running an Evaluation with evaluation_script.py

Key Features

Usage

Common Arguments

Example

Analyzing Results with analyse_results.py

Features

Usage

Arguments

Example

Understanding the Rich Output Format

Key Columns in detailed_results.csv

Migration Guide

4.8 KiB

Raw Blame History

Running an Evaluation with `evaluation_script.py`

Analyzing Results with `analyse_results.py`

Key Columns in `detailed_results.csv`