mirror of https://github.com/kolbytn/mindcraft.git synced 2025-07-29 19:35:28 +02:00

Johnathan Walker cc51242527 feat: Enhanced task evaluation system with flexible agent support and rich outcome reporting

- Added new evaluation.py with dynamic agent configuration support
- Implemented comprehensive test suite (38 tests, 100% pass rate)
- Enhanced evaluation_script.py with improved error handling and logging
- Updated analysis tools for better outcome reporting and visualization
- Added extensive documentation including architecture guide and user manuals
- Maintained backward compatibility with existing task formats
- Improved performance and reliability for multi-agent evaluations

Key improvements:
- Flexible agent count configuration (1-N agents)
- Rich outcome data structures with detailed metrics
- Comprehensive error handling and recovery mechanisms
- Enhanced logging and debugging capabilities
- Complete test coverage for production readiness

Files added/modified:
- tasks/evaluation.py (new core evaluation engine)
- tasks/test_*.py (comprehensive test suite)
- docs/ (complete documentation suite)
- Updated analysis and visualization tools

2025-06-15 22:01:19 -04:00

7.1 KiB

Raw Blame History

Mindcraft Evaluation System - Developer Guide

This guide provides technical documentation for developers working with the Mindcraft evaluation system.

Architecture Overview

The new evaluation module is designed to be modular and extensible. The core components are:

evaluation_script.py: The main entry point for running experiments. It handles setting up the environment, launching servers and agents, and collecting results.
evaluation.py: This module contains the core logic for analyzing and evaluating task outcomes. It defines the data structures for representing results and provides functions for extracting and aggregating them.
analyse_results.py: A script for post-experiment analysis. It can download results from S3, process them using the evaluation.py module, and generate detailed reports.

The data flow is as follows:

evaluation_script.py runs the experiments and generates raw JSON log files for each agent in an experiment folder.
During or after the experiment, evaluation_script.py or analyse_results.py is used to process these logs.
For each task folder, extract_task_outcome() is called.
extract_task_outcome() calls analyze_agent_log() for each agent's log file to get an AgentOutcome.
The individual AgentOutcome objects are aggregated into a single TaskRunOutcome.
Finally, all TaskRunOutcome objects are converted into a Pandas DataFrame by aggregate_results_to_dataframe() for easy analysis and reporting.

API Documentation for `tasks/evaluation.py`

The tasks/evaluation.py module provides the core functions for evaluating task results.

`analyze_agent_log(file_path: str) -> AgentOutcome`

Description: Analyzes a single agent's JSON log file. It extracts the score, timeout status, and final system message.
Arguments:
- file_path (str): The path to the agent's log file.
Returns: An AgentOutcome data class containing the results for a single agent.

`extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> TaskRunOutcome`

Description: Orchestrates the analysis of a single task run folder. It finds all agent logs, calls analyze_agent_log for each, and aggregates the results.
Arguments:
- folder_path (str): The path to the folder containing the agent logs for a single task run.
- task_definition (dict): The definition of the task, used to enrich the results with metadata.
Returns: A TaskRunOutcome data class containing the aggregated results for the task run.

`aggregate_results_to_dataframe(task_outcomes: List[TaskRunOutcome]) -> pd.DataFrame`

Description: Converts a list of TaskRunOutcome objects into a Pandas DataFrame, which is used for all further analysis and reporting.
Arguments:
- task_outcomes (list): A list of TaskRunOutcome objects.
Returns: A pd.DataFrame with the flattened and aggregated results.

Data Structure Specifications

The evaluation system uses two primary data classes to structure the results:

`AgentOutcome`

Defined in tasks/evaluation.py, this data class holds the results for a single agent's participation in a task.

Field	Type	Description
`raw_score`	`float`	The numerical score achieved by the agent.
`completion_status`	`CompletionStatus`	The granular status of the agent's task attempt.
`final_system_message`	`str`	The final system message from the log.
`agent_log_processed`	`bool`	Whether the agent's log was successfully processed.
`parsing_errors`	`List[str]`	A list of any errors encountered during parsing.
`timed_out`	`bool`	`True` if the agent timed out.

`TaskRunOutcome`

Defined in tasks/evaluation.py, this data class aggregates the outcomes from all agents involved in a single task run.

Field	Type	Description
`task_id`	`str`	The unique identifier for the task.
`model_name`	`str`	The name of the model used.
`agent_count`	`int`	The number of agents that participated in the task.
`task_type`	`str`	The type of the task (e.g., `cooking`, `crafting`).
`overall_raw_score`	`float`	The highest score achieved among all agents.
`overall_is_successful`	`bool`	`True` if the task was successfully completed by any agent.
`overall_completion_status`	`CompletionStatus`	The aggregated completion status for the entire task.
`total_agent_logs_found`	`int`	The number of agent log files found and processed.
`agent_outcomes`	`List[AgentOutcome]`	A list of `AgentOutcome` objects for each agent.
`task_definition_metrics`	`Dict[str, Any]`	A dictionary of metrics from the task definition file.

`CompletionStatus`

This Enum, defined in tasks/evaluation.py, provides a standardized set of outcomes for a task.

SUCCESS
FAILED_SCORE_ZERO
FAILED_PARTIAL_SCORE
TIMED_OUT
NO_SCORE_LOGGED
LOG_FILE_ERROR

Extension Points for Custom Analysis

The new system is designed to be easily extended. The primary extension point is the final DataFrame generated by aggregate_results_to_dataframe().

Since all the detailed results are available in a structured DataFrame, you can easily perform custom analysis using the full power of the Pandas library. You can write your own scripts to:

Load the detailed_results.csv file.
Perform custom aggregations, filtering, and statistical analysis.
Generate new plots and visualizations.
Correlate evaluation results with other data sources.

7.1 KiB Raw Blame History

Mindcraft Evaluation System - Developer Guide

Architecture Overview

API Documentation for tasks/evaluation.py

analyze_agent_log(file_path: str) -> AgentOutcome

extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> TaskRunOutcome

aggregate_results_to_dataframe(task_outcomes: List[TaskRunOutcome]) -> pd.DataFrame