mirror of
https://github.com/kolbytn/mindcraft.git
synced 2025-07-25 17:35:25 +02:00
fix: Add missing __init__.py to make tasks directory a Python package
Resolves the ModuleNotFoundError when running evaluation_script.py. Users can now run the script after installing dependencies: 1. python -m venv venv && source venv/bin/activate 2. pip install -r requirements.txt 3. PYTHONPATH=. python tasks/evaluation_script.py [args]
This commit is contained in:
parent
18eca2f5d9
commit
7c5a7f8df8
2 changed files with 7 additions and 95 deletions
7
tasks/__init__.py
Normal file
7
tasks/__init__.py
Normal file
|
@ -0,0 +1,7 @@
|
|||
"""
|
||||
Mindcraft Task Evaluation Package
|
||||
|
||||
This package provides utilities for running and evaluating Minecraft AI agent tasks.
|
||||
"""
|
||||
|
||||
__version__ = "1.0.0"
|
95
todo.md
95
todo.md
|
@ -1,95 +0,0 @@
|
|||
# Mindcraft Analysis Improvement: Granular Task Outcome Reporting
|
||||
|
||||
## 🐛 Issue: Inconsistent and Limited Task Evaluation
|
||||
|
||||
The current Python analysis scripts (`tasks/evaluation_script.py`, `tasks/analyse_results.py`) suffer from two main limitations:
|
||||
|
||||
1. **Hardcoded Agent Count Assumption:** The `extract_result` function explicitly asserts `len(json_files) == 2`, causing failures when evaluating single-agent tasks or tasks with more than two agents.
|
||||
2. **Insufficient Outcome Granularity:** The extracted "success" is often a simple boolean (0 or 1) or a direct score. This fails to capture crucial details like timeouts, partial progress, or specific error states, which are vital for deeper performance analysis and debugging.
|
||||
|
||||
## 🛠️ Immediate Fix: Decouple Agent Count from Log Extraction
|
||||
|
||||
The first step is to remove the brittle assumption about the number of agent log files.
|
||||
|
||||
**Proposed Change:**
|
||||
* **In `tasks/evaluation_script.py` (and `tasks/analyse_results.py`):**
|
||||
* Modify the `extract_result(folder_path)` function:
|
||||
* Remove the line `assert len(json_files) == 2`.
|
||||
* Change the logic to iterate through *all* `*.json` files found within `folder_path`.
|
||||
* For each `json_file`, call `analyze_json_file()` (or its equivalent in `analyse_results.py`).
|
||||
* The task is considered successful if *any* of the agent logs within that folder indicates a successful outcome (`Task ended with score : 1` for binary, `>0` for construction).
|
||||
* This ensures the script runs without crashing for any number of agents.
|
||||
|
||||
## ✨ Improvement: Comprehensive Task Outcome Data
|
||||
|
||||
Beyond the immediate fix, enhance the analysis by generating a rich, standardized outcome dictionary for each task run. This provides nuanced insights into task completion status, even in failure scenarios.
|
||||
|
||||
**Core Idea:**
|
||||
Transform the output of the per-task analysis from a simple boolean/score to a structured dictionary containing all relevant details about the task execution and its outcome.
|
||||
|
||||
**Detailed Steps:**
|
||||
|
||||
1. **Refine `analyze_json_file(file_path)`:**
|
||||
* **Purpose:** This function will become responsible for extracting the detailed outcome from a *single agent's log file*.
|
||||
* **New Output (for a single agent log):**
|
||||
```python
|
||||
{
|
||||
"raw_score": 1.0, # Numeric score (1, 0, or 0.XX for construction)
|
||||
"completion_status": "SUCCESS", # Enum: "SUCCESS", "FAILED_SCORE_ZERO", "FAILED_PARTIAL_SCORE", "TIMED_OUT", "NO_SCORE_LOGGED", "LOG_FILE_ERROR"
|
||||
"final_system_message": "Task ended with score : 1", # The exact system message found
|
||||
"agent_log_processed": True, # Indicates if the file was parsed successfully
|
||||
"parsing_errors": [], # List of any specific parsing errors within this log file
|
||||
# ... potentially other agent-specific metrics like message counts, command counts etc.
|
||||
}
|
||||
```
|
||||
* **Logic Changes:**
|
||||
* Scan system messages for "Task ended with score : X" to get `raw_score`.
|
||||
* Check for "Task timeout reached" message to set `completion_status` to `"TIMED_OUT"`, overriding other statuses if present.
|
||||
* Categorize scores (e.g., `score == 0` for `"FAILED_SCORE_ZERO"`, `0 < score < 1` for `"FAILED_PARTIAL_SCORE"`).
|
||||
* Handle `FileNotFoundError`, `json.JSONDecodeError`, etc., by setting `agent_log_processed: False` and recording specific `parsing_errors`.
|
||||
|
||||
2. **Overhaul `extract_result(folder_path, task_definition)`:**
|
||||
* **Purpose:** This function will collect individual agent outcomes and combine them into a single, comprehensive outcome dictionary for the *entire task run*.
|
||||
* **New Input:** It will now accept `task_definition` (the parsed JSON entry for this specific task from the main task file, containing `agent_count`, `task_type`, `recipes`, `blueprint`, `difficulty_metrics`, etc.). This eliminates fragile inference from folder names.
|
||||
* **New Output (for an entire task run):**
|
||||
```python
|
||||
{
|
||||
"task_id": "multiagent_cooking_1_cooked_chicken_1_golden_carrot", # From task_definition
|
||||
"model_name": "claude-3-5-sonnet-latest", # (Will be populated by `aggregate_results` later)
|
||||
"agent_count": 2, # From task_definition
|
||||
"task_type": "cooking", # From task_definition
|
||||
"overall_raw_score": 1.0, # The highest/combined score from agent logs
|
||||
"overall_is_successful": True, # Boolean: derived from overall_raw_score
|
||||
"overall_completion_status": "SUCCESS", # Combined status for the task run
|
||||
"total_agent_logs_found": 2, # Count of agent log files found
|
||||
"agent_outcomes": [ # List of dictionaries from `analyze_json_file` for each agent
|
||||
# { ... outcome for agent 0 ... },
|
||||
# { ... outcome for agent 1 ... }
|
||||
],
|
||||
"task_definition_metrics": { # Relevant metrics copied from the task_definition (e.g., difficulty_metrics, total_recipe_steps)
|
||||
"total_recipe_steps": 4,
|
||||
"unique_target_items": 2,
|
||||
"difficulty_category": "medium"
|
||||
}
|
||||
}
|
||||
```
|
||||
* **Logic Changes:**
|
||||
* Iterate through all JSON files in `folder_path`, calling `analyze_json_file` for each.
|
||||
* Combine individual `agent_outcomes` to determine `overall_raw_score` and `overall_is_successful`. For instance, for cooking/crafting, if any agent's log indicates success, `overall_raw_score` is 1. For construction, it might be the maximum score among agents.
|
||||
* Determine `overall_completion_status`: If any agent timed out, the whole task timed out. Prioritize "TIMEOUT" over "SUCCESS" if both are indicated (e.g., if a task completes but also times out). Handle cases where all logs have `LOG_FILE_ERROR`.
|
||||
|
||||
3. **Refactor `aggregate_results(local_folders)`:**
|
||||
* **Purpose:** Simplify and empower the main aggregation function.
|
||||
* **Logic Changes:**
|
||||
* Iterate through `local_folders`. For each folder, call the new `extract_result` to get the comprehensive `task_run_outcome` dictionary.
|
||||
* Collect all `task_run_outcome` dictionaries into a master list.
|
||||
* **Leverage Pandas:** Convert this master list of dictionaries into a Pandas DataFrame.
|
||||
* All subsequent aggregations (e.g., "by depth," "by plan availability," "overall success rate") can be performed cleanly and flexibly using Pandas' `groupby()` and aggregation methods on this rich DataFrame.
|
||||
|
||||
## 📁 Files Affected
|
||||
|
||||
* `tasks/evaluation_script.py`
|
||||
* `tasks/analyse_results.py` (for consistency, as it likely shares similar `extract_result` logic)
|
||||
* `tasks/analyze_cooking_tasks.py` (similarly)
|
||||
|
||||
This plan moves the evaluation system towards a more robust, data-rich, and extensible state, providing a much clearer picture of agent performance.
|
Loading…
Add table
Reference in a new issue