mindcraft/docs/USER_GUIDE.md
Johnathan Walker cc51242527 feat: Enhanced task evaluation system with flexible agent support and rich outcome reporting
- Added new evaluation.py with dynamic agent configuration support
- Implemented comprehensive test suite (38 tests, 100% pass rate)
- Enhanced evaluation_script.py with improved error handling and logging
- Updated analysis tools for better outcome reporting and visualization
- Added extensive documentation including architecture guide and user manuals
- Maintained backward compatibility with existing task formats
- Improved performance and reliability for multi-agent evaluations

Key improvements:
- Flexible agent count configuration (1-N agents)
- Rich outcome data structures with detailed metrics
- Comprehensive error handling and recovery mechanisms
- Enhanced logging and debugging capabilities
- Complete test coverage for production readiness

Files added/modified:
- tasks/evaluation.py (new core evaluation engine)
- tasks/test_*.py (comprehensive test suite)
- docs/ (complete documentation suite)
- Updated analysis and visualization tools
2025-06-15 22:01:19 -04:00

4.8 KiB

Mindcraft Evaluation System - User Guide

This guide provides instructions on how to use the updated evaluation system for Mindcraft tasks.

Running an Evaluation with evaluation_script.py

The evaluation_script.py is the primary script for running task evaluations. It launches the necessary Minecraft servers and agents to perform the tasks defined in a given task file.

Key Features

  • Parallel Execution: Run multiple experiments in parallel to speed up evaluation.
  • Flexible Configuration: Easily configure agent models, APIs, and other parameters through command-line arguments.
  • Automatic Results Aggregation: The script continuously monitors and aggregates results as experiments run.

Usage

The script is run from the command line:

python tasks/evaluation_script.py [OPTIONS]

Common Arguments

  • --task_path: Path to the JSON file containing task definitions (e.g., tasks/multiagent_crafting_tasks.json).
  • --num_agents: The number of agents to use for each task.
  • --num_exp: The number of times to repeat each task.
  • --num_parallel: The number of parallel servers to run for the evaluation.
  • --exp_name: A descriptive name for your experiment run.
  • --model: The model to use for the agents (e.g., gpt-4o-mini).
  • --api: The API to use (e.g., openai).
  • --check: Path to an existing experiment folder to re-evaluate results without running new experiments.

Example

To run an experiment named crafting_test with 2 agents on the crafting tasks, using 4 parallel servers:

python tasks/evaluation_script.py \
    --task_path tasks/multiagent_crafting_tasks.json \
    --exp_name crafting_test \
    --num_agents 2 \
    --num_parallel 4

Analyzing Results with analyse_results.py

Once an experiment is complete, you can use analyse_results.py to perform a detailed analysis of the results.

Features

  • S3 Integration: Download experiment results directly from an S3 bucket.
  • Local Analysis: Analyze results from a local directory.
  • Detailed Reports: Generates a CSV file with detailed metrics for each task run.

Usage

python tasks/analyse_results.py [OPTIONS]

Arguments

  • --local_dir: The local directory containing the experiment folders to analyze.
  • --task_file_path: Path to the original task definition file used for the experiment.
  • --s3_download: A flag to enable downloading results from S3.
  • --aws_bucket_name: The name of the S3 bucket.
  • --s3_folder_prefix: The folder prefix in the S3 bucket where results are stored.

Example

To analyze the results from a local experiment folder:

python tasks/analyse_results.py \
    --local_dir experiments/crafting_test_06-15_21-38 \
    --task_file_path tasks/multiagent_crafting_tasks.json

Understanding the Rich Output Format

The evaluation system produces two main output files in your experiment folder:

  1. results.json: A high-level summary of the experiment.
  2. detailed_results.csv: A detailed, row-per-task breakdown of the results.

Key Columns in detailed_results.csv

  • task_id: The unique identifier for the task.
  • overall_is_successful: A boolean (True/False) indicating if the task was completed successfully.
  • overall_completion_status: A more granular status of the task outcome. See CompletionStatus for possible values:
    • SUCCESS: The task was completed successfully.
    • FAILED_SCORE_ZERO: The task failed with a score of 0.
    • FAILED_PARTIAL_SCORE: The task failed but achieved a partial score.
    • TIMED_OUT: The task failed due to a timeout.
    • NO_SCORE_LOGGED: No score was recorded for the task.
    • LOG_FILE_ERROR: An error occurred while processing the agent's log file.
  • overall_raw_score: The highest score achieved by any agent for the task.
  • metric_*: A set of columns prefixed with metric_ that contain difficulty metrics from the task definition file.

Migration Guide

Migrating from the old evaluation system to the new one is straightforward:

  1. Use the new scripts: Use evaluation_script.py to run experiments and analyse_results.py for analysis.
  2. Familiarize yourself with the new output: The primary output is now the detailed_results.csv file. The analysis logic that was previously scattered in various scripts is now centralized and produces this single, comprehensive report.
  3. Leverage the new features: Take advantage of parallel execution and simplified configuration to run your evaluations more efficiently.