mindcraft/docs/evaluation_architecture.md

### **Evaluation System Architecture**

This document outlines the architecture for the refactored Mindcraft task evaluation system.

#### **1. Guiding Principles**

*   **Single Responsibility:** Each function and module will have a single, well-defined purpose.
*   **Data-Driven:** Logic will be driven by explicit data from task definitions, not inferred from fragile folder names.
*   **Decoupling:** Data extraction, aggregation, and reporting will be decoupled.
*   **Extensibility:** The system will be easy to extend with new metrics and task types.
*   **Backward Compatibility:** The final success rate calculation will remain consistent with the old method where a score of `1.0` means success.

#### **2. Core Components & Data Flow**

The new system will be centered around a new `evaluation` module, which will house the core logic. Existing scripts will be refactored to use this module.

```mermaid
graph TD
    subgraph "Entrypoints (Existing Scripts)"
        A["evaluation_script.py"]
        B["analyse_results.py"]
        C["analyze_cooking_tasks.py"]
    end

    subgraph "Core Evaluation Module (evaluation.py)"
        D[analyze_agent_log(file_path)]
        E[extract_task_outcome(folder_path, task_definition)]
        F[aggregate_results_to_dataframe(task_outcomes)]
    end

    subgraph "Data Sources"
        G["Agent Log Files (*.json)"]
        H["Task Definition File (e.g., multiagent_crafting_tasks.json)"]
    end

    subgraph "Output"
        I["Pandas DataFrame (Rich Data)"]
        J["Aggregated Reports (e.g., CSV, JSON)"]
    end

    A -- "Calls" --> E
    B -- "Calls" --> F
    C -- "Calls" --> E

    E -- "Iterates over agent logs, calls" --> D
    D -- "Reads" --> G
    E -- "Uses" --> H

    E -- "Returns list of" --> F
    F -- "Generates" --> I
    I -- "Used to create" --> J

```

#### **3. Data Structures**

The new system introduces two primary data structures to provide rich, detailed outcome reporting.

**3.1. Agent Outcome Dictionary**

Returned by `analyze_agent_log()`. Captures the result from a single agent's log file.

```json
{
    "raw_score": 1.0,
    "completion_status": "SUCCESS",
    "final_system_message": "Task ended with score : 1",
    "agent_log_processed": true,
    "parsing_errors": [],
    "timed_out": false
}
```

*   **`completion_status` (Enum):**
    *   `SUCCESS`: `raw_score` is 1.0.
    *   `FAILED_SCORE_ZERO`: `raw_score` is 0.0.
    *   `FAILED_PARTIAL_SCORE`: `raw_score` is > 0 and < 1 (for construction tasks).
    *   `TIMED_OUT`: "Task timeout reached" message is present.
    *   `NO_SCORE_LOGGED`: No score message was found.
    *   `LOG_FILE_ERROR`: The log file could not be read or parsed.

**3.2. Task Outcome Dictionary**

Returned by `extract_task_outcome()`. Aggregates outcomes from all agents for a single task run. This is the primary unit of data for analysis.

```json
{
    "task_id": "multiagent_cooking_1_cooked_chicken_1_golden_carrot",
    "model_name": "claude-3-5-sonnet-latest",
    "agent_count": 2,
    "task_type": "cooking",
    "overall_raw_score": 1.0,
    "overall_is_successful": true,
    "overall_completion_status": "SUCCESS",
    "total_agent_logs_found": 2,
    "agent_outcomes": [
        { "... Agent 0 Outcome Dictionary ..." },
        { "... Agent 1 Outcome Dictionary ..." }
    ],
    "task_definition_metrics": {
        "total_recipe_steps": 4,
        "unique_target_items": 2
    }
}
```

#### **4. Function Signatures and Responsibilities**

A new file, `tasks/evaluation.py`, will be created to house the core logic.

**File: `tasks/evaluation.py`**

```python
import pandas as pd
from typing import List, Dict, Any

def analyze_agent_log(file_path: str) -> Dict[str, Any]:
    """
    Analyzes a single agent's JSON log file.
    - Extracts raw_score, final_system_message, and timeout status.
    - Determines a detailed `completion_status`.
    - Handles file I/O and JSON parsing errors gracefully.
    - Returns an Agent Outcome Dictionary.
    """
    # Implementation as described in todo.md
    pass

def extract_task_outcome(folder_path: str, task_definition: Dict[str, Any]) -> Dict[str, Any]:
    """
    Orchestrates the analysis of a single task run folder.
    - Finds all agent logs (*.json) in the folder.
    - Calls analyze_agent_log() for each log.
    - Aggregates agent outcomes to determine overall_raw_score, overall_is_successful, and overall_completion_status.
    - Populates task metadata from the task_definition.
    - Returns a Task Outcome Dictionary.
    """
    # Implementation as described in todo.md
    pass

def aggregate_results_to_dataframe(task_outcomes: List[Dict[str, Any]]) -> pd.DataFrame:
    """
    Converts a list of Task Outcome Dictionaries into a Pandas DataFrame.
    - Flattens nested structures for easy analysis.
    - This DataFrame becomes the foundation for all subsequent reporting and analysis.
    """
    # Implementation as described in todo.md
    pass
```

#### **5. Integration and Refactoring Plan**

1.  **Create `tasks/evaluation.py`:** Implement the three functions defined above.
2.  **Refactor `tasks/evaluation_script.py`:**
    *   The `aggregate_results` function will be replaced. Instead, it will loop through experiment folders, load the corresponding `task_definition`, call `evaluation.extract_task_outcome()`, and collect the results.
    *   After the loop, it will call `evaluation.aggregate_results_to_dataframe()` to get the final DataFrame.
    *   All analysis (e.g., calculating overall success rate) will be done using the resulting DataFrame.
3.  **Refactor `tasks/analyse_results.py`:**
    *   It calls the `aggregate_results` function which is an enhanced version of `aggregate_results` from `evaluation.py` that adds model name extraction.
    *   The complex, name-based categorization (`is_base`, `base_without_plan`) will be entirely replaced by simple Pandas `groupby()` operations on the DataFrame's columns (e.g., `df.groupby('task_type').success_rate.mean()`).
4.  **Refactor `tasks/analyze_cooking_tasks.py`:**
    *   This script will also be refactored to use the new `evaluation` module.
    *   Analysis of blocked agents or specific items will be done by filtering the master DataFrame, not with custom parsing logic.

#### **6. Error Handling**

*   **File/JSON Errors:** `analyze_agent_log` will catch `FileNotFoundError` and `json.JSONDecodeError`, returning a `LOG_FILE_ERROR` status so the task run is not silently ignored.
*   **Missing Task Definitions:** The calling script will be responsible for handling cases where a task definition for a given folder cannot be found.
*   **No Logs Found:** `extract_task_outcome` will handle cases where a folder contains no `.json` files, reporting a count of 0 and an appropriate status.

This architecture directly addresses the requirements in `todo.md`, creating a centralized, robust, and extensible system for evaluating agent performance.