mirror of
https://github.com/kolbytn/mindcraft.git
synced 2025-07-28 02:45:27 +02:00

- Re-export enhanced function as 'aggregate_results' for backward compatibility - Users can now import aggregate_results and get the enhanced functionality - Updated architecture documentation to reflect the corrected API - Maintains intuitive API while providing enhanced model extraction features
170 lines
No EOL
7 KiB
Markdown
170 lines
No EOL
7 KiB
Markdown
### **Evaluation System Architecture**
|
|
|
|
This document outlines the architecture for the refactored Mindcraft task evaluation system.
|
|
|
|
#### **1. Guiding Principles**
|
|
|
|
* **Single Responsibility:** Each function and module will have a single, well-defined purpose.
|
|
* **Data-Driven:** Logic will be driven by explicit data from task definitions, not inferred from fragile folder names.
|
|
* **Decoupling:** Data extraction, aggregation, and reporting will be decoupled.
|
|
* **Extensibility:** The system will be easy to extend with new metrics and task types.
|
|
* **Backward Compatibility:** The final success rate calculation will remain consistent with the old method where a score of `1.0` means success.
|
|
|
|
#### **2. Core Components & Data Flow**
|
|
|
|
The new system will be centered around a new `evaluation` module, which will house the core logic. Existing scripts will be refactored to use this module.
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Entrypoints (Existing Scripts)"
|
|
A["evaluation_script.py"]
|
|
B["analyse_results.py"]
|
|
C["analyze_cooking_tasks.py"]
|
|
end
|
|
|
|
subgraph "Core Evaluation Module (evaluation.py)"
|
|
D[analyze_agent_log(file_path)]
|
|
E[extract_task_outcome(folder_path, task_definition)]
|
|
F[aggregate_results_to_dataframe(task_outcomes)]
|
|
end
|
|
|
|
subgraph "Data Sources"
|
|
G["Agent Log Files (*.json)"]
|
|
H["Task Definition File (e.g., multiagent_crafting_tasks.json)"]
|
|
end
|
|
|
|
subgraph "Output"
|
|
I["Pandas DataFrame (Rich Data)"]
|
|
J["Aggregated Reports (e.g., CSV, JSON)"]
|
|
end
|
|
|
|
A -- "Calls" --> E
|
|
B -- "Calls" --> F
|
|
C -- "Calls" --> E
|
|
|
|
E -- "Iterates over agent logs, calls" --> D
|
|
D -- "Reads" --> G
|
|
E -- "Uses" --> H
|
|
|
|
E -- "Returns list of" --> F
|
|
F -- "Generates" --> I
|
|
I -- "Used to create" --> J
|
|
|
|
```
|
|
|
|
#### **3. Data Structures**
|
|
|
|
The new system introduces two primary data structures to provide rich, detailed outcome reporting.
|
|
|
|
**3.1. Agent Outcome Dictionary**
|
|
|
|
Returned by `analyze_agent_log()`. Captures the result from a single agent's log file.
|
|
|
|
```json
|
|
{
|
|
"raw_score": 1.0,
|
|
"completion_status": "SUCCESS",
|
|
"final_system_message": "Task ended with score : 1",
|
|
"agent_log_processed": true,
|
|
"parsing_errors": [],
|
|
"timed_out": false
|
|
}
|
|
```
|
|
|
|
* **`completion_status` (Enum):**
|
|
* `SUCCESS`: `raw_score` is 1.0.
|
|
* `FAILED_SCORE_ZERO`: `raw_score` is 0.0.
|
|
* `FAILED_PARTIAL_SCORE`: `raw_score` is > 0 and < 1 (for construction tasks).
|
|
* `TIMED_OUT`: "Task timeout reached" message is present.
|
|
* `NO_SCORE_LOGGED`: No score message was found.
|
|
* `LOG_FILE_ERROR`: The log file could not be read or parsed.
|
|
|
|
**3.2. Task Outcome Dictionary**
|
|
|
|
Returned by `extract_task_outcome()`. Aggregates outcomes from all agents for a single task run. This is the primary unit of data for analysis.
|
|
|
|
```json
|
|
{
|
|
"task_id": "multiagent_cooking_1_cooked_chicken_1_golden_carrot",
|
|
"model_name": "claude-3-5-sonnet-latest",
|
|
"agent_count": 2,
|
|
"task_type": "cooking",
|
|
"overall_raw_score": 1.0,
|
|
"overall_is_successful": true,
|
|
"overall_completion_status": "SUCCESS",
|
|
"total_agent_logs_found": 2,
|
|
"agent_outcomes": [
|
|
{ "... Agent 0 Outcome Dictionary ..." },
|
|
{ "... Agent 1 Outcome Dictionary ..." }
|
|
],
|
|
"task_definition_metrics": {
|
|
"total_recipe_steps": 4,
|
|
"unique_target_items": 2
|
|
}
|
|
}
|
|
```
|
|
|
|
#### **4. Function Signatures and Responsibilities**
|
|
|
|
A new file, `tasks/evaluation.py`, will be created to house the core logic.
|
|
|
|
**File: `tasks/evaluation.py`**
|
|
|
|
```python
|
|
import pandas as pd
|
|
from typing import List, Dict, Any
|
|
|
|
def analyze_agent_log(file_path: str) -> Dict[str, Any]:
|
|
"""
|
|
Analyzes a single agent's JSON log file.
|
|
- Extracts raw_score, final_system_message, and timeout status.
|
|
- Determines a detailed `completion_status`.
|
|
- Handles file I/O and JSON parsing errors gracefully.
|
|
- Returns an Agent Outcome Dictionary.
|
|
"""
|
|
# Implementation as described in todo.md
|
|
pass
|
|
|
|
def extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> Dict[str, Any]:
|
|
"""
|
|
Orchestrates the analysis of a single task run folder.
|
|
- Finds all agent logs (*.json) in the folder.
|
|
- Calls analyze_agent_log() for each log.
|
|
- Aggregates agent outcomes to determine overall_raw_score, overall_is_successful, and overall_completion_status.
|
|
- Populates task metadata from the task_definition.
|
|
- Returns a Task Outcome Dictionary.
|
|
"""
|
|
# Implementation as described in todo.md
|
|
pass
|
|
|
|
def aggregate_results_to_dataframe(task_outcomes: List[Dict[str, Any]]) -> pd.DataFrame:
|
|
"""
|
|
Converts a list of Task Outcome Dictionaries into a Pandas DataFrame.
|
|
- Flattens nested structures for easy analysis.
|
|
- This DataFrame becomes the foundation for all subsequent reporting and analysis.
|
|
"""
|
|
# Implementation as described in todo.md
|
|
pass
|
|
```
|
|
|
|
#### **5. Integration and Refactoring Plan**
|
|
|
|
1. **Create `tasks/evaluation.py`:** Implement the three functions defined above.
|
|
2. **Refactor `tasks/evaluation_script.py`:**
|
|
* The `aggregate_results` function will be replaced. Instead, it will loop through experiment folders, load the corresponding `task_definition`, call `evaluation.extract_task_outcome()`, and collect the results.
|
|
* After the loop, it will call `evaluation.aggregate_results_to_dataframe()` to get the final DataFrame.
|
|
* All analysis (e.g., calculating overall success rate) will be done using the resulting DataFrame.
|
|
3. **Refactor `tasks/analyse_results.py`:**
|
|
* It calls the `aggregate_results` function which is an enhanced version of `aggregate_results` from `evaluation.py` that adds model name extraction.
|
|
* The complex, name-based categorization (`is_base`, `base_without_plan`) will be entirely replaced by simple Pandas `groupby()` operations on the DataFrame's columns (e.g., `df.groupby('task_type').success_rate.mean()`).
|
|
4. **Refactor `tasks/analyze_cooking_tasks.py`:**
|
|
* This script will also be refactored to use the new `evaluation` module.
|
|
* Analysis of blocked agents or specific items will be done by filtering the master DataFrame, not with custom parsing logic.
|
|
|
|
#### **6. Error Handling**
|
|
|
|
* **File/JSON Errors:** `analyze_agent_log` will catch `FileNotFoundError` and `json.JSONDecodeError`, returning a `LOG_FILE_ERROR` status so the task run is not silently ignored.
|
|
* **Missing Task Definitions:** The calling script will be responsible for handling cases where a task definition for a given folder cannot be found.
|
|
* **No Logs Found:** `extract_task_outcome` will handle cases where a folder contains no `.json` files, reporting a count of 0 and an appropriate status.
|
|
|
|
This architecture directly addresses the requirements in `todo.md`, creating a centralized, robust, and extensible system for evaluating agent performance. |