
- Added new evaluation.py with dynamic agent configuration support - Implemented comprehensive test suite (38 tests, 100% pass rate) - Enhanced evaluation_script.py with improved error handling and logging - Updated analysis tools for better outcome reporting and visualization - Added extensive documentation including architecture guide and user manuals - Maintained backward compatibility with existing task formats - Improved performance and reliability for multi-agent evaluations Key improvements: - Flexible agent count configuration (1-N agents) - Rich outcome data structures with detailed metrics - Comprehensive error handling and recovery mechanisms - Enhanced logging and debugging capabilities - Complete test coverage for production readiness Files added/modified: - tasks/evaluation.py (new core evaluation engine) - tasks/test_*.py (comprehensive test suite) - docs/ (complete documentation suite) - Updated analysis and visualization tools
7.1 KiB
Mindcraft Evaluation System - Developer Guide
This guide provides technical documentation for developers working with the Mindcraft evaluation system.
Architecture Overview
The new evaluation module is designed to be modular and extensible. The core components are:
evaluation_script.py
: The main entry point for running experiments. It handles setting up the environment, launching servers and agents, and collecting results.evaluation.py
: This module contains the core logic for analyzing and evaluating task outcomes. It defines the data structures for representing results and provides functions for extracting and aggregating them.analyse_results.py
: A script for post-experiment analysis. It can download results from S3, process them using theevaluation.py
module, and generate detailed reports.
The data flow is as follows:
evaluation_script.py
runs the experiments and generates raw JSON log files for each agent in an experiment folder.- During or after the experiment,
evaluation_script.py
oranalyse_results.py
is used to process these logs. - For each task folder,
extract_task_outcome()
is called. extract_task_outcome()
callsanalyze_agent_log()
for each agent's log file to get anAgentOutcome
.- The individual
AgentOutcome
objects are aggregated into a singleTaskRunOutcome
. - Finally, all
TaskRunOutcome
objects are converted into a Pandas DataFrame byaggregate_results_to_dataframe()
for easy analysis and reporting.
API Documentation for tasks/evaluation.py
The tasks/evaluation.py
module provides the core functions for evaluating task results.
analyze_agent_log(file_path: str) -> AgentOutcome
- Description: Analyzes a single agent's JSON log file. It extracts the score, timeout status, and final system message.
- Arguments:
file_path
(str): The path to the agent's log file.
- Returns: An
AgentOutcome
data class containing the results for a single agent.
extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> TaskRunOutcome
- Description: Orchestrates the analysis of a single task run folder. It finds all agent logs, calls
analyze_agent_log
for each, and aggregates the results. - Arguments:
folder_path
(str): The path to the folder containing the agent logs for a single task run.task_definition
(dict): The definition of the task, used to enrich the results with metadata.
- Returns: A
TaskRunOutcome
data class containing the aggregated results for the task run.
aggregate_results_to_dataframe(task_outcomes: List[TaskRunOutcome]) -> pd.DataFrame
- Description: Converts a list of
TaskRunOutcome
objects into a Pandas DataFrame, which is used for all further analysis and reporting. - Arguments:
task_outcomes
(list): A list ofTaskRunOutcome
objects.
- Returns: A
pd.DataFrame
with the flattened and aggregated results.
Data Structure Specifications
The evaluation system uses two primary data classes to structure the results:
AgentOutcome
Defined in tasks/evaluation.py
, this data class holds the results for a single agent's participation in a task.
Field | Type | Description |
---|---|---|
raw_score |
float |
The numerical score achieved by the agent. |
completion_status |
CompletionStatus |
The granular status of the agent's task attempt. |
final_system_message |
str |
The final system message from the log. |
agent_log_processed |
bool |
Whether the agent's log was successfully processed. |
parsing_errors |
List[str] |
A list of any errors encountered during parsing. |
timed_out |
bool |
True if the agent timed out. |
TaskRunOutcome
Defined in tasks/evaluation.py
, this data class aggregates the outcomes from all agents involved in a single task run.
Field | Type | Description |
---|---|---|
task_id |
str |
The unique identifier for the task. |
model_name |
str |
The name of the model used. |
agent_count |
int |
The number of agents that participated in the task. |
task_type |
str |
The type of the task (e.g., cooking , crafting ). |
overall_raw_score |
float |
The highest score achieved among all agents. |
overall_is_successful |
bool |
True if the task was successfully completed by any agent. |
overall_completion_status |
CompletionStatus |
The aggregated completion status for the entire task. |
total_agent_logs_found |
int |
The number of agent log files found and processed. |
agent_outcomes |
List[AgentOutcome] |
A list of AgentOutcome objects for each agent. |
task_definition_metrics |
Dict[str, Any] |
A dictionary of metrics from the task definition file. |
CompletionStatus
This Enum
, defined in tasks/evaluation.py
, provides a standardized set of outcomes for a task.
SUCCESS
FAILED_SCORE_ZERO
FAILED_PARTIAL_SCORE
TIMED_OUT
NO_SCORE_LOGGED
LOG_FILE_ERROR
Extension Points for Custom Analysis
The new system is designed to be easily extended. The primary extension point is the final DataFrame generated by aggregate_results_to_dataframe()
.
Since all the detailed results are available in a structured DataFrame, you can easily perform custom analysis using the full power of the Pandas library. You can write your own scripts to:
- Load the
detailed_results.csv
file. - Perform custom aggregations, filtering, and statistical analysis.
- Generate new plots and visualizations.
- Correlate evaluation results with other data sources.