mindcraft/docs/DEVELOPER_GUIDE.md
Johnathan Walker cc51242527 feat: Enhanced task evaluation system with flexible agent support and rich outcome reporting
- Added new evaluation.py with dynamic agent configuration support
- Implemented comprehensive test suite (38 tests, 100% pass rate)
- Enhanced evaluation_script.py with improved error handling and logging
- Updated analysis tools for better outcome reporting and visualization
- Added extensive documentation including architecture guide and user manuals
- Maintained backward compatibility with existing task formats
- Improved performance and reliability for multi-agent evaluations

Key improvements:
- Flexible agent count configuration (1-N agents)
- Rich outcome data structures with detailed metrics
- Comprehensive error handling and recovery mechanisms
- Enhanced logging and debugging capabilities
- Complete test coverage for production readiness

Files added/modified:
- tasks/evaluation.py (new core evaluation engine)
- tasks/test_*.py (comprehensive test suite)
- docs/ (complete documentation suite)
- Updated analysis and visualization tools
2025-06-15 22:01:19 -04:00

7.1 KiB

Mindcraft Evaluation System - Developer Guide

This guide provides technical documentation for developers working with the Mindcraft evaluation system.

Architecture Overview

The new evaluation module is designed to be modular and extensible. The core components are:

  • evaluation_script.py: The main entry point for running experiments. It handles setting up the environment, launching servers and agents, and collecting results.
  • evaluation.py: This module contains the core logic for analyzing and evaluating task outcomes. It defines the data structures for representing results and provides functions for extracting and aggregating them.
  • analyse_results.py: A script for post-experiment analysis. It can download results from S3, process them using the evaluation.py module, and generate detailed reports.

The data flow is as follows:

  1. evaluation_script.py runs the experiments and generates raw JSON log files for each agent in an experiment folder.
  2. During or after the experiment, evaluation_script.py or analyse_results.py is used to process these logs.
  3. For each task folder, extract_task_outcome() is called.
  4. extract_task_outcome() calls analyze_agent_log() for each agent's log file to get an AgentOutcome.
  5. The individual AgentOutcome objects are aggregated into a single TaskRunOutcome.
  6. Finally, all TaskRunOutcome objects are converted into a Pandas DataFrame by aggregate_results_to_dataframe() for easy analysis and reporting.

API Documentation for tasks/evaluation.py

The tasks/evaluation.py module provides the core functions for evaluating task results.

analyze_agent_log(file_path: str) -> AgentOutcome

  • Description: Analyzes a single agent's JSON log file. It extracts the score, timeout status, and final system message.
  • Arguments:
    • file_path (str): The path to the agent's log file.
  • Returns: An AgentOutcome data class containing the results for a single agent.

extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> TaskRunOutcome

  • Description: Orchestrates the analysis of a single task run folder. It finds all agent logs, calls analyze_agent_log for each, and aggregates the results.
  • Arguments:
    • folder_path (str): The path to the folder containing the agent logs for a single task run.
    • task_definition (dict): The definition of the task, used to enrich the results with metadata.
  • Returns: A TaskRunOutcome data class containing the aggregated results for the task run.

aggregate_results_to_dataframe(task_outcomes: List[TaskRunOutcome]) -> pd.DataFrame

  • Description: Converts a list of TaskRunOutcome objects into a Pandas DataFrame, which is used for all further analysis and reporting.
  • Arguments:
    • task_outcomes (list): A list of TaskRunOutcome objects.
  • Returns: A pd.DataFrame with the flattened and aggregated results.

Data Structure Specifications

The evaluation system uses two primary data classes to structure the results:

AgentOutcome

Defined in tasks/evaluation.py, this data class holds the results for a single agent's participation in a task.

Field Type Description
raw_score float The numerical score achieved by the agent.
completion_status CompletionStatus The granular status of the agent's task attempt.
final_system_message str The final system message from the log.
agent_log_processed bool Whether the agent's log was successfully processed.
parsing_errors List[str] A list of any errors encountered during parsing.
timed_out bool True if the agent timed out.

TaskRunOutcome

Defined in tasks/evaluation.py, this data class aggregates the outcomes from all agents involved in a single task run.

Field Type Description
task_id str The unique identifier for the task.
model_name str The name of the model used.
agent_count int The number of agents that participated in the task.
task_type str The type of the task (e.g., cooking, crafting).
overall_raw_score float The highest score achieved among all agents.
overall_is_successful bool True if the task was successfully completed by any agent.
overall_completion_status CompletionStatus The aggregated completion status for the entire task.
total_agent_logs_found int The number of agent log files found and processed.
agent_outcomes List[AgentOutcome] A list of AgentOutcome objects for each agent.
task_definition_metrics Dict[str, Any] A dictionary of metrics from the task definition file.

CompletionStatus

This Enum, defined in tasks/evaluation.py, provides a standardized set of outcomes for a task.

  • SUCCESS
  • FAILED_SCORE_ZERO
  • FAILED_PARTIAL_SCORE
  • TIMED_OUT
  • NO_SCORE_LOGGED
  • LOG_FILE_ERROR

Extension Points for Custom Analysis

The new system is designed to be easily extended. The primary extension point is the final DataFrame generated by aggregate_results_to_dataframe().

Since all the detailed results are available in a structured DataFrame, you can easily perform custom analysis using the full power of the Pandas library. You can write your own scripts to:

  • Load the detailed_results.csv file.
  • Perform custom aggregations, filtering, and statistical analysis.
  • Generate new plots and visualizations.
  • Correlate evaluation results with other data sources.