mirror of
https://github.com/kolbytn/mindcraft.git
synced 2025-07-25 17:35:25 +02:00
Merge pull request #10 from Sweaterdog/fix-stt-naudiodon-fallback
Merge stt naudiodon fallback
This commit is contained in:
commit
c024c0dd1b
21 changed files with 2611 additions and 286 deletions
81
README.md
81
README.md
|
@ -14,71 +14,6 @@ Do not connect this bot to public servers with coding enabled. This project allo
|
|||
- [Node.js Installed](https://nodejs.org/) (at least v14)
|
||||
- One of these: [OpenAI API Key](https://openai.com/blog/openai-api) | [Gemini API Key](https://aistudio.google.com/app/apikey) | [Anthropic API Key](https://docs.anthropic.com/claude/docs/getting-access-to-claude) | [Replicate API Key](https://replicate.com/) | [Hugging Face API Key](https://huggingface.co/) | [Groq API Key](https://console.groq.com/keys) | [Ollama Installed](https://ollama.com/download). | [Mistral API Key](https://docs.mistral.ai/getting-started/models/models_overview/) | [Qwen API Key [Intl.]](https://www.alibabacloud.com/help/en/model-studio/developer-reference/get-api-key)/[[cn]](https://help.aliyun.com/zh/model-studio/getting-started/first-api-call-to-qwen?) | [Novita AI API Key](https://novita.ai/settings?utm_source=github_mindcraft&utm_medium=github_readme&utm_campaign=link#key-management) |
|
||||
|
||||
## Installation Prerequisites
|
||||
|
||||
### `naudiodon` for Speech-to-Text (STT)
|
||||
|
||||
The STT (Speech-to-Text) functionality in Mindcraft uses the `naudiodon` package for audio input. `naudiodon` is a native Node.js addon and might require additional steps to compile correctly during `npm install`.
|
||||
|
||||
**`naudiodon` is an optional dependency.** This means:
|
||||
* If `naudiodon` fails to install or build, the core Mindcraft application will still run.
|
||||
* However, the Speech-to-Text (STT) feature will be automatically disabled if `naudiodon` is not available. You will see warnings in the console if it fails to load.
|
||||
* If you wish to use STT and encounter build issues with `naudiodon`, please ensure you have the necessary build tools and libraries listed below for your operating system.
|
||||
|
||||
**General Requirements for Building `naudiodon`:**
|
||||
* **Node.js:** Ensure Node.js (v14+) is properly installed and added to your system's PATH.
|
||||
* **Python:** `node-gyp` (the tool used to build native addons like `naudiodon`) requires Python. Recent versions of `node-gyp` are compatible with Python 3.x. Make sure Python is installed and accessible.
|
||||
* **C++ Compiler Toolchain:** A C++ compiler (like g++ or MSVC) and related build tools (like `make` or MSBuild) are necessary.
|
||||
* **PortAudio Library:** `naudiodon` specifically requires the PortAudio library.
|
||||
|
||||
**Operating System Specifics for `PortAudio` (and `naudiodon` build):**
|
||||
|
||||
### Linux
|
||||
* **Debian/Ubuntu:**
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install build-essential libasound2-dev libportaudio-dev
|
||||
```
|
||||
(`build-essential` provides g++, make, etc. `libasound2-dev` is for ALSA, and `libportaudio-dev` is crucial for `naudiodon`.)
|
||||
|
||||
* **Fedora/RHEL/CentOS:**
|
||||
```bash
|
||||
# For newer Fedora (using dnf)
|
||||
sudo dnf groupinstall "Development Tools"
|
||||
sudo dnf install alsa-lib-devel portaudio-devel
|
||||
|
||||
# For older RHEL/CentOS (using yum)
|
||||
sudo yum groupinstall "Development Tools"
|
||||
sudo yum install alsa-lib-devel portaudio-devel
|
||||
```
|
||||
(`portaudio-devel` is the equivalent of `libportaudio-dev`.)
|
||||
|
||||
### Windows
|
||||
* **Visual Studio C++ Build Tools:** This is the recommended way.
|
||||
1. Download the [Visual Studio Installer](https://visualstudio.microsoft.com/downloads/).
|
||||
2. Run the installer and select "Desktop development with C++" under the "Workloads" tab. This will install the necessary C++ compiler, MSBuild, and Windows SDKs.
|
||||
3. Ensure that Python is correctly configured for `node-gyp`. If you have multiple Python versions, you might need to tell `npm` which one to use (e.g., `npm config set python C:\path\to\python.exe`) or ensure your desired Python version is first in your system's PATH.
|
||||
* **MSYS2/MinGW:** While possible, this can be more complex. You would need to compile/install PortAudio within the MSYS2 environment and ensure `node-gyp` is configured to use the MinGW toolchain. Using the Visual Studio C++ Build Tools is generally more straightforward for `node-gyp` on Windows.
|
||||
|
||||
### macOS
|
||||
* **Xcode Command Line Tools:**
|
||||
```bash
|
||||
xcode-select --install
|
||||
```
|
||||
(This installs Clang, make, and other necessary build tools.)
|
||||
* **PortAudio:**
|
||||
```bash
|
||||
brew install portaudio
|
||||
```
|
||||
(Homebrew is the easiest way to install PortAudio on macOS.)
|
||||
* **pkg-config (if needed):**
|
||||
```bash
|
||||
brew install pkg-config
|
||||
```
|
||||
(Sometimes required for build scripts to find library information.)
|
||||
|
||||
If you see warnings or errors related to `naudiodon` during `npm install` and you *do not* intend to use the STT feature, these can typically be ignored. If you *do* want STT, ensure the above prerequisites are met.
|
||||
|
||||
## Install and Run
|
||||
|
||||
1. Make sure you have the requirements above. If you plan to use the STT (Speech-to-Text) feature, also review the "Installation Prerequisites" section regarding `naudiodon`.
|
||||
|
@ -253,6 +188,22 @@ Supported Embedding APIs: `openai`, `google`, `replicate`, `huggingface`, `novit
|
|||
|
||||
If you try to use an unsupported model, then it will default to a simple word-overlap method. Expect reduced performance, recommend mixing APIs to ensure embedding support.
|
||||
|
||||
## Dataset collection
|
||||
|
||||
Mindcraft has the capabilities to collect data from you playing with the bots, which can be used to generate training data to fine-tune models such as Andy-4. To do this, enable logging inside of `settings.js`, then navigate to the `logs` folder.
|
||||
|
||||
Inside of the logs folder, and installing the dependecies, you will find a file named `generate_usernames.py`, you need to run this in order to convert your collected data into a usable dataset. This will generate a bunch of random names to replace the name of your bot, and your username. Both of which improve performance later on.
|
||||
|
||||
To run it, run `python generate_usernames.py`. The max amount of usernames will take up multiple Terabytes of data. If for some reason you want to do this, run it with the `--make_all` flag.
|
||||
|
||||
Next, you need to set up `convert.py` to include every username that interacted with the bot, as well as the bot's own username. This is done by adding / changing the usernames in the `ORIGINAL_USERNAMES` list.
|
||||
|
||||
After this, you are all set up for conversion! Since you might not want to convert all data at once, you must change the names of the `.csv` file*(s)* that you want to convert to `Andy_pre1`. If more than one file is wanted for conversion, change `1` to the next number, this value can be as high as you want.
|
||||
|
||||
To convert, run `python convert.py`, if you get a dependency error, ensure you are in a virtual python environment rather than a global one.
|
||||
|
||||
For setting up vision datasets, run `convert.py` with the flag of `--vision`, this will do the same thing as the rest of the conversions, but change the format to an image-friendly way.
|
||||
|
||||
## Specifying Profiles via Command Line
|
||||
|
||||
By default, the program will use the profiles specified in `settings.js`. You can specify one or more agent profiles using the `--profiles` argument: `node main.js --profiles ./profiles/andy.json ./profiles/jill.json`
|
||||
|
|
83
logger.js
83
logger.js
|
@ -1,5 +1,3 @@
|
|||
// --- START OF FILE logger.js ---
|
||||
|
||||
import { writeFileSync, mkdirSync, existsSync, appendFileSync, readFileSync } from 'fs';
|
||||
import { join } from 'path';
|
||||
import settings from './settings.js'; // Import settings
|
||||
|
@ -133,13 +131,61 @@ function cleanReasoningMarkers(input) {
|
|||
return input.replace(/\/think/g, '').replace(/\/no_think/g, '').trim();
|
||||
}
|
||||
|
||||
// Helper function to clean imagePath from messages for text logs
|
||||
function cleanImagePathFromMessages(input) {
|
||||
if (typeof input !== 'string') {
|
||||
return input;
|
||||
}
|
||||
|
||||
try {
|
||||
const parsed = JSON.parse(input);
|
||||
if (Array.isArray(parsed)) {
|
||||
const cleaned = parsed.map(msg => {
|
||||
let cleanedMsg = { ...msg }; // Clone message
|
||||
|
||||
// Remove top-level imagePath
|
||||
if (cleanedMsg.imagePath !== undefined) {
|
||||
delete cleanedMsg.imagePath;
|
||||
}
|
||||
|
||||
// Remove image_url from content array
|
||||
if (Array.isArray(cleanedMsg.content)) {
|
||||
cleanedMsg.content = cleanedMsg.content.filter(part =>
|
||||
part.type !== 'image_url' &&
|
||||
!(part.type === 'image' && part.source) // Also filter Claude-style image parts
|
||||
);
|
||||
|
||||
// If content becomes empty after filtering, remove it or set to empty string
|
||||
if (cleanedMsg.content.length === 0) {
|
||||
cleanedMsg.content = "";
|
||||
} else if (cleanedMsg.content.length === 1 &&
|
||||
cleanedMsg.content[0].type === 'text' &&
|
||||
!cleanedMsg.content[0].text?.trim()) {
|
||||
cleanedMsg.content = "";
|
||||
}
|
||||
}
|
||||
return cleanedMsg;
|
||||
});
|
||||
return JSON.stringify(cleaned);
|
||||
}
|
||||
} catch (e) {
|
||||
// If not valid JSON, return as-is
|
||||
return input;
|
||||
}
|
||||
|
||||
return input;
|
||||
}
|
||||
|
||||
// --- Main Logging Function (for text-based input/output) ---
|
||||
export function log(input, response) {
|
||||
const trimmedInputStr = input ? (typeof input === 'string' ? input.trim() : JSON.stringify(input)) : "";
|
||||
const trimmedResponse = response ? String(response).trim() : ""; // Ensure response is a string
|
||||
|
||||
// Clean reasoning markers from input before logging
|
||||
const cleanedInput = cleanReasoningMarkers(trimmedInputStr);
|
||||
let cleanedInput = cleanReasoningMarkers(trimmedInputStr);
|
||||
|
||||
// Clean imagePath from messages for text logs (normal/reasoning)
|
||||
cleanedInput = cleanImagePathFromMessages(cleanedInput);
|
||||
|
||||
// Basic filtering
|
||||
if (!cleanedInput && !trimmedResponse) {
|
||||
|
@ -248,6 +294,7 @@ export function logVision(conversationHistory, imageBuffer, response, visionMess
|
|||
"Context length exceeded",
|
||||
"Image input modality is not enabled",
|
||||
"An unexpected error occurred",
|
||||
"Image captured for always active vision", // Filter out placeholder responses
|
||||
];
|
||||
|
||||
if (errorMessages.some(err => trimmedResponse.includes(err))) {
|
||||
|
@ -271,31 +318,17 @@ export function logVision(conversationHistory, imageBuffer, response, visionMess
|
|||
writeFileSync(imagePath, imageBuffer);
|
||||
logCounts.vision_images_saved++;
|
||||
|
||||
// Extract the actual message sent with the image
|
||||
// This is typically the vision prompt/instruction
|
||||
let inputMessage = visionMessage;
|
||||
if (!inputMessage && conversationHistory.length > 0) {
|
||||
// Try to get the last user message or system message
|
||||
const lastMessage = conversationHistory[conversationHistory.length - 1];
|
||||
if (typeof lastMessage.content === 'string') {
|
||||
inputMessage = lastMessage.content;
|
||||
} else if (Array.isArray(lastMessage.content)) {
|
||||
// Find text content in the message
|
||||
const textContent = lastMessage.content.find(c => c.type === 'text');
|
||||
inputMessage = textContent ? textContent.text : '';
|
||||
}
|
||||
}
|
||||
|
||||
// Fallback to conversation history if no specific message
|
||||
if (!inputMessage) {
|
||||
inputMessage = formatConversationInput(conversationHistory);
|
||||
}
|
||||
// Clean the conversation history to remove imagePath and image data before logging
|
||||
const cleanedConversationHistory = JSON.parse(cleanImagePathFromMessages(JSON.stringify(conversationHistory)));
|
||||
|
||||
// Format the complete input as JSON (cleaned conversation history)
|
||||
const inputData = JSON.stringify(cleanedConversationHistory);
|
||||
|
||||
// Create metadata entry in JSONL format for HuggingFace
|
||||
const metadataEntry = {
|
||||
file_name: relativeImagePath,
|
||||
text: inputMessage,
|
||||
response: trimmedResponse,
|
||||
input: inputData, // Cleaned JSON conversation history
|
||||
response: trimmedResponse, // Actual model response, not placeholder
|
||||
timestamp: timestamp
|
||||
};
|
||||
|
||||
|
@ -397,5 +430,3 @@ function countVisionEntries(metadataFile) {
|
|||
|
||||
// Initialize counts at startup
|
||||
initializeCounts();
|
||||
|
||||
// --- END OF FILE logger.js ---
|
964
logs/convert.py
Normal file
964
logs/convert.py
Normal file
|
@ -0,0 +1,964 @@
|
|||
import csv
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import os
|
||||
import random
|
||||
from typing import List, Dict
|
||||
import pandas as pd
|
||||
from USERNAMES import Get_Usernames
|
||||
from transformers import AutoTokenizer
|
||||
from tqdm import tqdm
|
||||
import torch
|
||||
from PIL import Image
|
||||
import base64
|
||||
from io import BytesIO
|
||||
|
||||
# Try to import pandas-image-methods for vision data handling
|
||||
try:
|
||||
from pandas_image_methods import PILMethods
|
||||
PANDAS_IMAGE_METHODS_AVAILABLE = True
|
||||
# Enable PIL methods for pandas
|
||||
pd.api.extensions.register_series_accessor("pil")(PILMethods)
|
||||
except ImportError:
|
||||
PANDAS_IMAGE_METHODS_AVAILABLE = False
|
||||
logging.warning("pandas-image-methods not available. Install with: pip install pandas-image-methods")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Increase CSV field size limit to avoid errors with very large fields.
|
||||
maxInt = sys.maxsize
|
||||
while True:
|
||||
try:
|
||||
csv.field_size_limit(maxInt)
|
||||
break
|
||||
except OverflowError:
|
||||
maxInt = int(maxInt/10)
|
||||
|
||||
# Define the original usernames.
|
||||
ORIGINAL_USERNAMES = [
|
||||
"Your_username", "Andy"
|
||||
]
|
||||
|
||||
# Define outputs that should cause the conversation to be deleted.
|
||||
BAD_OUTPUTS = {
|
||||
"My brain just kinda stopped working. Try again.",
|
||||
"My brain disconnected, try again.",
|
||||
"Vision is only supported",
|
||||
"Context length exceeded",
|
||||
"Image input modality is not enabled",
|
||||
"An unexpected error occurred",
|
||||
}
|
||||
|
||||
MINECRAFT_USERNAMES = list(set(Get_Usernames())) # Remove duplicates
|
||||
duplicate_count = len(Get_Usernames()) - len(MINECRAFT_USERNAMES)
|
||||
|
||||
available_minecraft_usernames = list(MINECRAFT_USERNAMES) # Create a copy for tracking
|
||||
|
||||
global username_replaced_count
|
||||
global reasoning_replaced_count
|
||||
username_replaced_count = 0
|
||||
reasoning_replaced_count = 0
|
||||
|
||||
def replace_reasoning_prompt(text: str) -> str:
|
||||
global reasoning_replaced_count
|
||||
replaced = False
|
||||
# Optionally, replace the reasoning prompt if needed.
|
||||
if replaced:
|
||||
reasoning_replaced_count += 1
|
||||
return text
|
||||
|
||||
def parse_json_safely(text: str) -> List[Dict[str, str]]:
|
||||
try:
|
||||
if text.startswith('[') and '],' in text:
|
||||
parts = text.split('],')
|
||||
text = parts[0] + ']'
|
||||
if text.startswith('"') and text.endswith('"'):
|
||||
text = text[1:-1]
|
||||
text = text.replace('""', '"')
|
||||
data = json.loads(text)
|
||||
if isinstance(data, list) and len(data) > 0 and isinstance(data[0], list):
|
||||
data = data[0]
|
||||
converted_messages = []
|
||||
for msg in data:
|
||||
if isinstance(msg, dict) and 'role' in msg and 'content' in msg:
|
||||
converted_messages.append({
|
||||
"from": "human" if msg['role'] in ("system", "user") else "gpt",
|
||||
"value": msg['content']
|
||||
})
|
||||
return converted_messages
|
||||
except Exception as e:
|
||||
logger.debug(f"Error parsing JSON: {e}") # Suppressed error level
|
||||
return [{
|
||||
"from": "human",
|
||||
"value": text
|
||||
}]
|
||||
|
||||
def create_conversation_thread(row: Dict[str, str]) -> List[Dict[str, str]]:
|
||||
messages = []
|
||||
conversation_replacements = {} # Track username replacements for this conversation ONLY
|
||||
|
||||
def replace_usernames_in_message(text: str) -> str:
|
||||
global username_replaced_count
|
||||
global available_minecraft_usernames
|
||||
replaced = False
|
||||
|
||||
if not MINECRAFT_USERNAMES:
|
||||
return text
|
||||
|
||||
for orig_name in ORIGINAL_USERNAMES:
|
||||
if orig_name in text:
|
||||
if orig_name not in conversation_replacements:
|
||||
# If we've used all available names, reset the list
|
||||
if not available_minecraft_usernames:
|
||||
available_minecraft_usernames = list(MINECRAFT_USERNAMES)
|
||||
# Get a random name from the available ones
|
||||
replacement = random.choice(available_minecraft_usernames)
|
||||
available_minecraft_usernames.remove(replacement)
|
||||
conversation_replacements[orig_name] = replacement
|
||||
replaced = True
|
||||
# Use existing replacement for this conversation
|
||||
text = text.replace(orig_name, conversation_replacements[orig_name])
|
||||
|
||||
if replaced:
|
||||
username_replaced_count += 1
|
||||
return text
|
||||
|
||||
if row.get("input"):
|
||||
messages = parse_json_safely(str(row["input"]))
|
||||
# Apply consistent username replacements to all messages
|
||||
for msg in messages:
|
||||
msg["value"] = replace_usernames_in_message(msg["value"])
|
||||
|
||||
if row.get("output"):
|
||||
output_text = str(row["output"]).strip()
|
||||
output_text = replace_usernames_in_message(output_text)
|
||||
output_text = replace_reasoning_prompt(output_text)
|
||||
messages.append({
|
||||
"from": "gpt",
|
||||
"value": output_text
|
||||
})
|
||||
|
||||
return messages
|
||||
|
||||
def conversation_has_bad_output(messages: List[Dict[str, str]]) -> bool:
|
||||
for msg in messages:
|
||||
if msg["from"] == "gpt" and msg["value"].strip() in BAD_OUTPUTS:
|
||||
return True
|
||||
return False
|
||||
|
||||
def load_image_from_base64(base64_string: str):
|
||||
"""Convert base64 string to PIL Image"""
|
||||
try:
|
||||
if base64_string.startswith('data:'):
|
||||
base64_string = base64_string.split(',')[1]
|
||||
|
||||
image_bytes = base64.b64decode(base64_string)
|
||||
image = Image.open(BytesIO(image_bytes))
|
||||
|
||||
if image.mode in ('RGBA', 'LA', 'P'):
|
||||
image = image.convert('RGB')
|
||||
|
||||
return image
|
||||
except Exception as e:
|
||||
logger.debug(f"Error loading image from base64: {e}")
|
||||
return Image.new('RGB', (224, 224), color='gray')
|
||||
|
||||
def pil_image_to_parquet_dict(image: Image.Image, filename: str) -> Dict:
|
||||
"""Converts a PIL Image to the dictionary format {bytes, path} for Parquet."""
|
||||
img_byte_arr = BytesIO()
|
||||
# Determine a suitable save format
|
||||
save_format = image.format if image.format and image.format in Image.SAVE else 'PNG'
|
||||
|
||||
# Handle specific mode conversions if necessary for the chosen format
|
||||
if save_format == 'PNG' and image.mode not in ['RGB', 'RGBA', 'L', 'P', 'I', 'F']: # Common PNG modes
|
||||
# Convert to a mode PNG supports, e.g., RGBA to preserve transparency
|
||||
image_to_save = image.convert("RGBA")
|
||||
elif save_format == 'JPEG' and image.mode not in ['RGB', 'L', 'CMYK']:
|
||||
# Convert to a mode JPEG supports
|
||||
image_to_save = image.convert("RGB")
|
||||
else:
|
||||
image_to_save = image
|
||||
|
||||
try:
|
||||
image_to_save.save(img_byte_arr, format=save_format)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not save image {filename} in format {save_format} (Error: {e}). Attempting PNG.")
|
||||
save_format = 'PNG'
|
||||
if image_to_save.mode not in ['RGB', 'RGBA', 'L', 'P', 'I', 'F']:
|
||||
image_to_save = image.convert("RGBA") # Default to RGBA for PNG
|
||||
image_to_save.save(img_byte_arr, format=save_format)
|
||||
|
||||
return {"bytes": img_byte_arr.getvalue(), "path": filename}
|
||||
|
||||
def extract_vision_data_from_jsonl(jsonl_path: str) -> List[Dict]:
|
||||
"""Extract vision data from HuggingFace JSONL metadata format"""
|
||||
if not os.path.isfile(jsonl_path):
|
||||
logger.error(f"JSONL file not found: {jsonl_path}")
|
||||
return []
|
||||
|
||||
logger.info(f"Reading vision metadata: {jsonl_path}")
|
||||
|
||||
# Get the directory containing the JSONL file (should contain images folder)
|
||||
base_dir = os.path.dirname(jsonl_path)
|
||||
images_dir = os.path.join(base_dir, 'images')
|
||||
|
||||
if not os.path.isdir(images_dir):
|
||||
logger.error(f"Images directory not found: {images_dir}")
|
||||
return []
|
||||
|
||||
vision_data = []
|
||||
|
||||
with open(jsonl_path, 'r', encoding='utf-8') as f:
|
||||
for line_num, line in enumerate(f, 1):
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
|
||||
# Extract required fields - logger.js uses 'input' and 'response', not 'text'
|
||||
file_name = entry.get('file_name', '')
|
||||
input_data = entry.get('input', '')
|
||||
response = entry.get('response', '')
|
||||
|
||||
if not all([file_name, input_data, response]):
|
||||
logger.warning(f"Line {line_num}: Missing required fields (file_name, input, response)")
|
||||
continue
|
||||
|
||||
# Check for bad outputs
|
||||
if response.strip() in BAD_OUTPUTS:
|
||||
logger.debug(f"Line {line_num}: Skipping bad output")
|
||||
continue
|
||||
|
||||
# Load the image
|
||||
image_path = os.path.join(base_dir, file_name)
|
||||
if not os.path.isfile(image_path):
|
||||
logger.warning(f"Line {line_num}: Image file not found: {image_path}")
|
||||
continue
|
||||
|
||||
try:
|
||||
image = Image.open(image_path)
|
||||
if image.mode in ('RGBA', 'LA', 'P') and image.format != 'PNG': # PNG handles these modes well
|
||||
image = image.convert('RGB') # Convert to RGB if not PNG to simplify, or handle more modes in pil_image_to_parquet_dict
|
||||
except Exception as e:
|
||||
logger.warning(f"Line {line_num}: Error loading image {image_path}: {e}")
|
||||
continue
|
||||
|
||||
# Convert PIL image to parquet-compatible dict
|
||||
relative_image_path_for_dict = file_name # Use the relative path from metadata
|
||||
image_dict = pil_image_to_parquet_dict(image, relative_image_path_for_dict)
|
||||
|
||||
# Create a separate conversation_replacements for each vision entry
|
||||
entry_conversation_replacements = {}
|
||||
|
||||
# Replace usernames consistently within this single entry
|
||||
def replace_usernames_in_text(text: str) -> str:
|
||||
global username_replaced_count
|
||||
global available_minecraft_usernames
|
||||
replaced = False
|
||||
|
||||
if not MINECRAFT_USERNAMES:
|
||||
return text
|
||||
|
||||
for orig_name in ORIGINAL_USERNAMES:
|
||||
if orig_name in text:
|
||||
if orig_name not in entry_conversation_replacements:
|
||||
if not available_minecraft_usernames:
|
||||
available_minecraft_usernames = list(MINECRAFT_USERNAMES)
|
||||
replacement = random.choice(available_minecraft_usernames)
|
||||
available_minecraft_usernames.remove(replacement)
|
||||
entry_conversation_replacements[orig_name] = replacement
|
||||
replaced = True
|
||||
text = text.replace(orig_name, entry_conversation_replacements[orig_name])
|
||||
|
||||
if replaced:
|
||||
username_replaced_count += 1
|
||||
return text
|
||||
|
||||
# Parse the input data (conversation history) and build conversation
|
||||
try:
|
||||
# The input_data should be JSON string of conversation history
|
||||
conversation_history = json.loads(input_data)
|
||||
|
||||
# Build the conversation in unsloth format
|
||||
conversation = []
|
||||
|
||||
if isinstance(conversation_history, list):
|
||||
for msg in conversation_history:
|
||||
if isinstance(msg, dict) and 'role' in msg:
|
||||
role = msg['role']
|
||||
# Map system messages to user role for simplicity
|
||||
if role == 'system':
|
||||
role = 'user'
|
||||
|
||||
content_parts = []
|
||||
|
||||
# Handle different content formats
|
||||
if 'content' in msg:
|
||||
content = msg['content']
|
||||
if isinstance(content, str):
|
||||
# Simple string content
|
||||
text_content = replace_usernames_in_text(content)
|
||||
content_parts.append({"type": "text", "text": text_content})
|
||||
elif isinstance(content, list):
|
||||
# Array content (multimodal messages)
|
||||
for part in content:
|
||||
if isinstance(part, dict):
|
||||
if part.get('type') == 'text':
|
||||
text_content = part.get('text', '')
|
||||
if text_content:
|
||||
text_content = replace_usernames_in_text(text_content)
|
||||
content_parts.append({"type": "text", "text": text_content})
|
||||
# Skip image parts from history - we'll add the main image to the user message
|
||||
elif any(key in msg for key in ['text', 'message', 'value']):
|
||||
# Handle other message formats
|
||||
text_content = msg.get('text') or msg.get('message') or msg.get('value', '')
|
||||
if text_content:
|
||||
text_content = replace_usernames_in_text(str(text_content))
|
||||
content_parts.append({"type": "text", "text": text_content})
|
||||
|
||||
if content_parts:
|
||||
conversation.append({
|
||||
"role": role,
|
||||
"content": content_parts
|
||||
})
|
||||
|
||||
# If no conversation history was parsed or it's empty, create a simple user message
|
||||
if not conversation:
|
||||
# Use the raw input data as text
|
||||
text_content = replace_usernames_in_text(str(input_data).strip())
|
||||
conversation.append({
|
||||
"role": "user",
|
||||
"content": [{"type": "text", "text": text_content}]
|
||||
})
|
||||
|
||||
# Add the image to the last user message (or create one if none exists)
|
||||
user_msg_found = False
|
||||
for i in range(len(conversation) - 1, -1, -1):
|
||||
if conversation[i]["role"] == "user":
|
||||
# Add image to this user message
|
||||
conversation[i]["content"].append({"type": "image", "image": image_dict})
|
||||
user_msg_found = True
|
||||
break
|
||||
|
||||
if not user_msg_found:
|
||||
# No user message found, create one with just the image
|
||||
conversation.append({
|
||||
"role": "user",
|
||||
"content": [{"type": "image", "image": image_dict}]
|
||||
})
|
||||
|
||||
# Add the assistant response
|
||||
response_text = replace_usernames_in_text(response)
|
||||
conversation.append({
|
||||
"role": "assistant",
|
||||
"content": [{"type": "text", "text": response_text}]
|
||||
})
|
||||
|
||||
except json.JSONDecodeError:
|
||||
# If input_data is not valid JSON, create simple conversation
|
||||
text_content = replace_usernames_in_text(str(input_data).strip())
|
||||
response_text = replace_usernames_in_text(response)
|
||||
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": text_content},
|
||||
{"type": "image", "image": image_dict}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [{"type": "text", "text": response_text}]
|
||||
}
|
||||
]
|
||||
except Exception as e:
|
||||
logger.debug(f"Line {line_num}: Error parsing conversation history: {e}")
|
||||
# Fallback to simple conversation
|
||||
text_content = replace_usernames_in_text(str(input_data).strip())
|
||||
response_text = replace_usernames_in_text(response)
|
||||
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": text_content},
|
||||
{"type": "image", "image": image_dict}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [{"type": "text", "text": response_text}]
|
||||
}
|
||||
]
|
||||
|
||||
vision_data.append(conversation)
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
logger.warning(f"Line {line_num}: JSON decode error: {e}")
|
||||
continue
|
||||
except Exception as e:
|
||||
logger.warning(f"Line {line_num}: Unexpected error: {e}")
|
||||
continue
|
||||
|
||||
logger.info(f"Successfully processed {len(vision_data)} vision entries")
|
||||
return vision_data
|
||||
|
||||
def extract_vision_conversations_from_csv(csv_input: str) -> List[Dict]:
|
||||
"""Extract vision data from CSV with input,image,output columns"""
|
||||
if not os.path.isfile(csv_input):
|
||||
logger.debug(f"Vision CSV file not found: {csv_input}")
|
||||
return []
|
||||
|
||||
logger.info(f"Reading Vision CSV: {csv_input}")
|
||||
|
||||
try:
|
||||
df = pd.read_csv(csv_input)
|
||||
required_columns = ['input', 'image', 'output']
|
||||
|
||||
if not all(col in df.columns for col in required_columns):
|
||||
logger.debug(f"Vision CSV missing required columns: {required_columns}")
|
||||
return []
|
||||
|
||||
vision_data = []
|
||||
|
||||
for idx, row in df.iterrows():
|
||||
try:
|
||||
input_text = str(row['input']).strip()
|
||||
image_b64 = str(row['image']).strip()
|
||||
output_text = str(row['output']).strip()
|
||||
|
||||
if not all([input_text, image_b64, output_text]):
|
||||
continue
|
||||
|
||||
# Check for bad outputs
|
||||
if output_text in BAD_OUTPUTS:
|
||||
continue
|
||||
|
||||
# Create separate replacements for each row
|
||||
row_conversation_replacements = {}
|
||||
|
||||
# Replace usernames consistently within this single row
|
||||
def replace_usernames_in_text(text: str) -> str:
|
||||
global username_replaced_count
|
||||
global available_minecraft_usernames
|
||||
replaced = False
|
||||
|
||||
if not MINECRAFT_USERNAMES:
|
||||
return text
|
||||
|
||||
for orig_name in ORIGINAL_USERNAMES:
|
||||
if orig_name in text:
|
||||
if orig_name not in row_conversation_replacements:
|
||||
if not available_minecraft_usernames:
|
||||
available_minecraft_usernames = list(MINECRAFT_USERNAMES)
|
||||
replacement = random.choice(available_minecraft_usernames)
|
||||
available_minecraft_usernames.remove(replacement)
|
||||
row_conversation_replacements[orig_name] = replacement
|
||||
replaced = True
|
||||
text = text.replace(orig_name, row_conversation_replacements[orig_name])
|
||||
|
||||
if replaced:
|
||||
username_replaced_count += 1
|
||||
return text
|
||||
|
||||
input_text = replace_usernames_in_text(input_text)
|
||||
output_text = replace_usernames_in_text(output_text)
|
||||
|
||||
# Load image from base64
|
||||
image = load_image_from_base64(image_b64)
|
||||
|
||||
# Convert PIL image to parquet-compatible dict
|
||||
image_filename_for_dict = f"image_from_base64_{idx}.png" # Create a placeholder filename
|
||||
image_dict = pil_image_to_parquet_dict(image, image_filename_for_dict)
|
||||
|
||||
# Create conversation in unsloth format
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": input_text},
|
||||
{"type": "image", "image": image_dict}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [{"type": "text", "text": output_text}]
|
||||
}
|
||||
]
|
||||
|
||||
vision_data.append(conversation)
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Row {idx}: Error processing vision data: {e}")
|
||||
continue
|
||||
|
||||
logger.info(f"Successfully processed {len(vision_data)} vision entries from CSV")
|
||||
return vision_data
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading vision CSV {csv_input}: {e}")
|
||||
return []
|
||||
|
||||
def extract_conversations_from_csv(csv_input: str) -> List[List[Dict[str, str]]]:
|
||||
if not os.path.isfile(csv_input):
|
||||
logger.debug(f"CSV file not found: {csv_input}")
|
||||
return []
|
||||
|
||||
logger.info(f"Reading CSV: {csv_input}")
|
||||
valid_rows = []
|
||||
extra_issue_rows = 0
|
||||
total_extra_columns = 0
|
||||
|
||||
with open(csv_input, newline='', encoding="utf-8") as csvfile:
|
||||
reader = csv.reader(csvfile)
|
||||
try:
|
||||
header = next(reader)
|
||||
except StopIteration:
|
||||
logger.debug(f"CSV file {csv_input} is empty.")
|
||||
return []
|
||||
|
||||
header_expected = {"input", "output"}
|
||||
header_map = {col: idx for idx, col in enumerate(header)}
|
||||
if not header_expected.issubset(set(header)):
|
||||
logger.debug(f"CSV header does not contain required columns: {header_expected}")
|
||||
return []
|
||||
|
||||
for idx, row in enumerate(reader, start=2):
|
||||
non_empty_count = sum(1 for field in row if field.strip() != "")
|
||||
if non_empty_count > 2:
|
||||
extra = non_empty_count - 2
|
||||
extra_issue_rows += 1
|
||||
total_extra_columns += extra
|
||||
logger.info(f"Row {idx} has {extra} extra filled column(s); row skipped.")
|
||||
continue
|
||||
row_dict = {col: row[header_map[col]] if header_map[col] < len(row) else "" for col in header_expected}
|
||||
valid_rows.append(row_dict)
|
||||
|
||||
logger.info(f"Excluded {extra_issue_rows} row(s) with extra columns (total extra columns: {total_extra_columns}).")
|
||||
df = pd.DataFrame(valid_rows)
|
||||
conversations = []
|
||||
for idx, row in df.iterrows():
|
||||
conv = create_conversation_thread(row)
|
||||
if conversation_has_bad_output(conv):
|
||||
continue
|
||||
conversations.append(conv)
|
||||
return conversations
|
||||
|
||||
def extract_vision_conversations_from_csv(csv_input: str) -> List[Dict]:
|
||||
"""Extract vision data from CSV with input,image,output columns"""
|
||||
if not os.path.isfile(csv_input):
|
||||
logger.debug(f"Vision CSV file not found: {csv_input}")
|
||||
return []
|
||||
|
||||
logger.info(f"Reading Vision CSV: {csv_input}")
|
||||
|
||||
try:
|
||||
df = pd.read_csv(csv_input)
|
||||
required_columns = ['input', 'image', 'output']
|
||||
|
||||
if not all(col in df.columns for col in required_columns):
|
||||
logger.debug(f"Vision CSV missing required columns: {required_columns}")
|
||||
return []
|
||||
|
||||
vision_data = []
|
||||
|
||||
for idx, row in df.iterrows():
|
||||
try:
|
||||
input_text = str(row['input']).strip()
|
||||
image_b64 = str(row['image']).strip()
|
||||
output_text = str(row['output']).strip()
|
||||
|
||||
if not all([input_text, image_b64, output_text]):
|
||||
continue
|
||||
|
||||
# Check for bad outputs
|
||||
if output_text in BAD_OUTPUTS:
|
||||
continue
|
||||
|
||||
# Create separate replacements for each row
|
||||
row_conversation_replacements = {}
|
||||
|
||||
# Replace usernames consistently within this single row
|
||||
def replace_usernames_in_text(text: str) -> str:
|
||||
global username_replaced_count
|
||||
global available_minecraft_usernames
|
||||
replaced = False
|
||||
|
||||
if not MINECRAFT_USERNAMES:
|
||||
return text
|
||||
|
||||
for orig_name in ORIGINAL_USERNAMES:
|
||||
if orig_name in text:
|
||||
if orig_name not in row_conversation_replacements:
|
||||
if not available_minecraft_usernames:
|
||||
available_minecraft_usernames = list(MINECRAFT_USERNAMES)
|
||||
replacement = random.choice(available_minecraft_usernames)
|
||||
available_minecraft_usernames.remove(replacement)
|
||||
row_conversation_replacements[orig_name] = replacement
|
||||
replaced = True
|
||||
text = text.replace(orig_name, row_conversation_replacements[orig_name])
|
||||
|
||||
if replaced:
|
||||
username_replaced_count += 1
|
||||
return text
|
||||
|
||||
input_text = replace_usernames_in_text(input_text)
|
||||
output_text = replace_usernames_in_text(output_text)
|
||||
|
||||
# Load image from base64
|
||||
image = load_image_from_base64(image_b64)
|
||||
|
||||
# Convert PIL image to parquet-compatible dict
|
||||
image_filename_for_dict = f"image_from_base64_{idx}.png" # Create a placeholder filename
|
||||
image_dict = pil_image_to_parquet_dict(image, image_filename_for_dict)
|
||||
|
||||
# Create conversation in unsloth format
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": input_text},
|
||||
{"type": "image", "image": image_dict}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [{"type": "text", "text": output_text}]
|
||||
}
|
||||
]
|
||||
|
||||
vision_data.append(conversation)
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Row {idx}: Error processing vision data: {e}")
|
||||
continue
|
||||
|
||||
logger.info(f"Successfully processed {len(vision_data)} vision entries from CSV")
|
||||
return vision_data
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading vision CSV {csv_input}: {e}")
|
||||
return []
|
||||
|
||||
def extract_conversations_from_json(json_input: str) -> List[List[Dict[str, str]]]:
|
||||
logger.info(f"Reading JSON: {json_input}")
|
||||
try:
|
||||
with open(json_input, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
except Exception as e:
|
||||
logger.debug(f"Error reading {json_input}: {e}")
|
||||
return []
|
||||
conversations = []
|
||||
for conv in data:
|
||||
messages = []
|
||||
if "system" in conv and conv["system"]:
|
||||
system_text = str(conv["system"]).strip()
|
||||
system_text = replace_reasoning_prompt(system_text)
|
||||
messages.append({"from": "human", "value": system_text})
|
||||
if "user" in conv and conv["user"]:
|
||||
user_text = str(conv["user"]).strip()
|
||||
user_text = replace_reasoning_prompt(user_text)
|
||||
messages.append({"from": "human", "value": user_text})
|
||||
if "assistant" in conv and conv["assistant"]:
|
||||
assistant_text = str(conv["assistant"]).strip()
|
||||
assistant_text = replace_reasoning_prompt(assistant_text)
|
||||
messages.append({"from": "gpt", "value": assistant_text})
|
||||
if messages and not conversation_has_bad_output(messages):
|
||||
conversations.append(messages)
|
||||
return conversations
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Handle vision dataset processing
|
||||
if '--vision' in sys.argv:
|
||||
if not PANDAS_IMAGE_METHODS_AVAILABLE:
|
||||
logger.error("pandas-image-methods is required for --vision flag. Install with: pip install pandas-image-methods")
|
||||
sys.exit(1)
|
||||
|
||||
# Look for vision data files
|
||||
vision_files = []
|
||||
|
||||
# Check for HuggingFace format (metadata.jsonl)
|
||||
metadata_jsonl = "vision_dataset/metadata.jsonl"
|
||||
if os.path.isfile(metadata_jsonl):
|
||||
vision_files.append((metadata_jsonl, 'jsonl'))
|
||||
|
||||
# Check for CSV format vision logs
|
||||
vision_csv = "vision_logs.csv"
|
||||
if os.path.isfile(vision_csv):
|
||||
vision_files.append((vision_csv, 'csv'))
|
||||
|
||||
# Check for numbered files
|
||||
i = 1
|
||||
while True:
|
||||
jsonl_file = f"vision_dataset{i}/metadata.jsonl"
|
||||
csv_file = f"vision_logs{i}.csv"
|
||||
found_any = False
|
||||
|
||||
if os.path.isfile(jsonl_file):
|
||||
vision_files.append((jsonl_file, 'jsonl'))
|
||||
found_any = True
|
||||
if os.path.isfile(csv_file):
|
||||
vision_files.append((csv_file, 'csv'))
|
||||
found_any = True
|
||||
|
||||
if not found_any:
|
||||
break
|
||||
i += 1
|
||||
|
||||
if not vision_files:
|
||||
logger.error("No vision dataset files found for --vision flag!")
|
||||
logger.info("Looking for:")
|
||||
logger.info(" - vision_dataset/metadata.jsonl (HuggingFace format)")
|
||||
logger.info(" - vision_logs.csv (CSV format)")
|
||||
logger.info(" - vision_datasetN/metadata.jsonl")
|
||||
logger.info(" - vision_logsN.csv")
|
||||
sys.exit(1)
|
||||
|
||||
logger.info(f"Found {len(vision_files)} vision files: {[f for f, _ in vision_files]}")
|
||||
|
||||
# Process all vision files
|
||||
all_vision_data = []
|
||||
total_count = 0
|
||||
file_counts = {}
|
||||
|
||||
for file_path, file_type in vision_files:
|
||||
if file_type == 'jsonl':
|
||||
vision_data = extract_vision_data_from_jsonl(file_path)
|
||||
else: # csv
|
||||
vision_data = extract_vision_conversations_from_csv(file_path)
|
||||
|
||||
file_counts[file_path] = len(vision_data)
|
||||
all_vision_data.extend(vision_data)
|
||||
total_count += len(vision_data)
|
||||
|
||||
if not all_vision_data:
|
||||
logger.error("No valid vision data found!")
|
||||
sys.exit(1)
|
||||
|
||||
# Check for tokenization flags
|
||||
do_tokenize = '--tokenize' in sys.argv
|
||||
tokenizer = None
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
if do_tokenize:
|
||||
logger.info("Loading tokenizer 'unsloth/Llama-3.2-1B-Instruct-bnb-4bit'...")
|
||||
tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct-bnb-4bit")
|
||||
|
||||
# Tokenize if requested
|
||||
if do_tokenize and tokenizer:
|
||||
all_texts = []
|
||||
for entry in all_vision_data:
|
||||
all_texts.append(entry['input'])
|
||||
all_texts.append(entry['output'])
|
||||
|
||||
total_tokens = 0
|
||||
logger.info("Tokenizing vision data...")
|
||||
for text in tqdm(all_texts, desc="Tokenizing", unit="msg"):
|
||||
encoded = tokenizer(text, return_tensors="pt")
|
||||
input_ids = encoded["input_ids"].to(device)
|
||||
total_tokens += input_ids.shape[-1]
|
||||
logger.info(f"Total tokens across all vision data: {total_tokens}")
|
||||
|
||||
# Remove duplicates based on conversation content
|
||||
unique_vision_data = []
|
||||
seen_keys = set()
|
||||
|
||||
for conversation in all_vision_data:
|
||||
# Create a key from the text content of the conversation
|
||||
key_parts = []
|
||||
for msg in conversation:
|
||||
if msg["role"] in ["user", "assistant"]:
|
||||
for content_part in msg["content"]:
|
||||
if content_part["type"] == "text":
|
||||
key_parts.append(content_part["text"].strip())
|
||||
|
||||
key = tuple(key_parts)
|
||||
if key not in seen_keys:
|
||||
seen_keys.add(key)
|
||||
unique_vision_data.append(conversation)
|
||||
|
||||
all_vision_data = unique_vision_data
|
||||
logger.info(f"After deduplication: {len(all_vision_data)} unique vision conversations")
|
||||
|
||||
# Shuffle the data
|
||||
random.shuffle(all_vision_data)
|
||||
|
||||
# Images are already in parquet-compatible dict format within all_vision_data
|
||||
# No further image processing needed here before creating DataFrame
|
||||
|
||||
# Create DataFrame with conversations column (unsloth format)
|
||||
df_final = pd.DataFrame({"conversations": all_vision_data})
|
||||
|
||||
output_parquet = "Andy_vision_conversations.parquet"
|
||||
|
||||
logger.info(f"Writing vision dataset to {output_parquet}")
|
||||
try:
|
||||
df_final.to_parquet(output_parquet, index=False)
|
||||
abs_path = os.path.abspath(output_parquet)
|
||||
logger.info(f"Successfully wrote vision dataset to: {abs_path}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error writing Parquet file: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
logger.info(
|
||||
f"\n"
|
||||
f"--------------------------------------------------------------------------------------\n"
|
||||
f"Vision conversion complete! Processed {total_count} vision conversations from {len(vision_files)} files.\n"
|
||||
f"Replaced {username_replaced_count} usernames across conversations.\n"
|
||||
f"Total usernames available: {len(MINECRAFT_USERNAMES)}\n"
|
||||
f"Final dataset size: {len(all_vision_data)} unique conversations\n"
|
||||
f"--------------------------------------------------------------------------------------\n"
|
||||
)
|
||||
|
||||
# Log counts per file
|
||||
for file_path, count in file_counts.items():
|
||||
logger.info(f"File '{file_path}' contributed {count} conversations.")
|
||||
|
||||
sys.exit(0)
|
||||
|
||||
# Regular processing for non-vision data
|
||||
base_filename = "Andy_pre"
|
||||
files = []
|
||||
i = 1
|
||||
while True:
|
||||
csv_file = f"{base_filename}{i}.csv"
|
||||
json_file = f"{base_filename}{i}.json"
|
||||
if not os.path.isfile(csv_file) and not os.path.isfile(json_file):
|
||||
break
|
||||
if os.path.isfile(csv_file):
|
||||
files.append((csv_file, 'csv'))
|
||||
if os.path.isfile(json_file):
|
||||
files.append((json_file, 'json'))
|
||||
i += 1
|
||||
|
||||
if not files:
|
||||
logger.info("No CSV or JSON files found with pattern Andy_preN.(csv|json)")
|
||||
sys.exit(1)
|
||||
|
||||
# Check for tokenization flags
|
||||
do_tokenize = '--tokenize' in sys.argv
|
||||
do_tokenize_largest = '--tokenize_largest' in sys.argv
|
||||
tokenizer = None
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
if do_tokenize or do_tokenize_largest:
|
||||
logger.info("Loading tokenizer 'unsloth/Llama-3.2-1B-Instruct-bnb-4bit'...")
|
||||
tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct-bnb-4bit")
|
||||
|
||||
logger.info(f"Found {len(files)} files: {[f for f, _ in files]}")
|
||||
combined_conversations = []
|
||||
total_count = 0
|
||||
file_conversation_counts = {}
|
||||
|
||||
for file, ftype in files:
|
||||
if ftype == 'csv':
|
||||
convs = extract_conversations_from_csv(file)
|
||||
else:
|
||||
convs = extract_conversations_from_json(file)
|
||||
file_conversation_counts[file] = len(convs)
|
||||
combined_conversations.extend(convs)
|
||||
total_count += len(convs)
|
||||
|
||||
# Tokenize all data and count tokens
|
||||
if do_tokenize:
|
||||
all_texts = [msg["value"] for conv in combined_conversations for msg in conv]
|
||||
total_tokens = 0
|
||||
logger.info("Tokenizing all data with progress bar and GPU acceleration...")
|
||||
for text in tqdm(all_texts, desc="Tokenizing", unit="msg"):
|
||||
encoded = tokenizer(text, return_tensors="pt")
|
||||
input_ids = encoded["input_ids"].to(device)
|
||||
total_tokens += input_ids.shape[-1]
|
||||
logger.info(f"Total tokens across all data: {total_tokens}")
|
||||
|
||||
# Tokenize 5 largest conversations
|
||||
if do_tokenize_largest:
|
||||
conv_token_counts = []
|
||||
logger.info("Tokenizing largest conversations with progress bar and GPU acceleration...")
|
||||
for conv in tqdm(combined_conversations, desc="Tokenizing convs", unit="conv"):
|
||||
text = "\n".join(msg["value"] for msg in conv)
|
||||
encoded = tokenizer(text, return_tensors="pt")
|
||||
input_ids = encoded["input_ids"].to(device)
|
||||
conv_token_counts.append((input_ids.shape[-1], conv))
|
||||
# sort and take top 5
|
||||
conv_token_counts.sort(key=lambda x: x[0], reverse=True)
|
||||
top5 = conv_token_counts[:5]
|
||||
max_tokens = max(count for count, _ in top5)
|
||||
for idx, (count, _) in enumerate(top5, 1):
|
||||
logger.info(f"Top {idx} conversation tokens: {count}")
|
||||
logger.info(f"Maximum tokens in top 5: {max_tokens}")
|
||||
|
||||
# Clean up GPT messages
|
||||
for conv in combined_conversations:
|
||||
for msg in conv:
|
||||
if msg["from"] == "gpt":
|
||||
msg["value"] = msg["value"].replace("<think>\nundefined</think>\n", "").replace("<think>\nundefined</think>", "").strip()
|
||||
|
||||
unique_conversations = []
|
||||
seen_keys = set()
|
||||
for conv in combined_conversations:
|
||||
if len(conv) < 2:
|
||||
key = tuple(msg["value"] for msg in conv)
|
||||
else:
|
||||
key = (conv[0]["value"].strip(), conv[-1]["value"].strip())
|
||||
if key not in seen_keys:
|
||||
seen_keys.add(key)
|
||||
unique_conversations.append(conv)
|
||||
combined_conversations = unique_conversations
|
||||
|
||||
random.shuffle(combined_conversations)
|
||||
|
||||
# Handle codeOnly flag
|
||||
if '--codeOnly' in sys.argv:
|
||||
coding = []
|
||||
noncoding = []
|
||||
for conv in combined_conversations:
|
||||
has_code = any("```" in msg["value"] for msg in conv) or (
|
||||
conv and conv[-1]["from"] == "gpt" and "!newAction(" in conv[-1]["value"]
|
||||
)
|
||||
if has_code:
|
||||
coding.append(conv)
|
||||
else:
|
||||
noncoding.append(conv)
|
||||
logger.info(f"Found {len(coding)} coding examples and {len(noncoding)} non-coding examples.")
|
||||
noncoding_count = int(round(0.15 * len(coding)))
|
||||
if noncoding_count > len(noncoding):
|
||||
noncoding_count = len(noncoding)
|
||||
selected_noncoding = random.sample(noncoding, noncoding_count) if noncoding_count > 0 else []
|
||||
final_conversations = coding + selected_noncoding
|
||||
random.shuffle(final_conversations)
|
||||
combined_conversations = final_conversations
|
||||
|
||||
if '--codeOnly' in sys.argv:
|
||||
df_final = pd.DataFrame({"conversations": combined_conversations})
|
||||
output_parquet = "Andy_conversations_codeOnly.parquet"
|
||||
else:
|
||||
df_final = pd.DataFrame({"conversations": combined_conversations})
|
||||
output_parquet = "Andy_conversations.parquet"
|
||||
|
||||
logger.info(f"Writing output to {output_parquet}")
|
||||
try:
|
||||
df_final.to_parquet(output_parquet, index=False)
|
||||
abs_path = os.path.abspath(output_parquet)
|
||||
logger.info(f"Successfully wrote output to: {abs_path}")
|
||||
except Exception as e:
|
||||
logger.debug(f"Error writing Parquet file: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
logger.info(
|
||||
f"\n"
|
||||
f"--------------------------------------------------------------------------------------\n\n"
|
||||
f"Conversion complete! Processed {total_count} conversations from {len(files)} files. \n"
|
||||
f"Replaced {username_replaced_count} usernames across {total_count} conversations. \n"
|
||||
f"Total amount of usernames to choose from: {len(MINECRAFT_USERNAMES)} (removed {duplicate_count} duplicates) \n"
|
||||
f"--------------------------------------------------------------------------------------\n\n"
|
||||
)
|
||||
|
||||
# Log conversation counts per file.
|
||||
for file, count in file_conversation_counts.items():
|
||||
logger.info(f"File '{file}' contributed {count} conversations.")
|
1117
logs/generate_usernames.py
Normal file
1117
logs/generate_usernames.py
Normal file
File diff suppressed because it is too large
Load diff
18
logs/requirements.txt
Normal file
18
logs/requirements.txt
Normal file
|
@ -0,0 +1,18 @@
|
|||
# Core dependencies for convert.py
|
||||
pandas>=1.3.0
|
||||
pandas-image-methods>=0.2.0
|
||||
transformers>=4.20.0
|
||||
torch>=1.12.0
|
||||
tqdm>=4.64.0
|
||||
pillow>=9.0.0
|
||||
pyarrow>=10.0.0
|
||||
|
||||
# Optional dependencies for enhanced functionality
|
||||
datasets>=2.0.0
|
||||
dask[complete]>=2022.7.0
|
||||
distributed>=2022.7.0
|
||||
|
||||
# Additional utility dependencies
|
||||
numpy>=1.21.0
|
||||
requests>=2.25.0
|
||||
|
|
@ -10,7 +10,6 @@
|
|||
"express": "^4.18.2",
|
||||
"google-translate-api-x": "^10.7.1",
|
||||
"groq-sdk": "^0.5.0",
|
||||
"mic": "^2.1.2",
|
||||
"minecraft-data": "^3.78.0",
|
||||
"mineflayer": "^4.26.0",
|
||||
"mineflayer-armor-manager": "^2.0.1",
|
||||
|
@ -33,7 +32,8 @@
|
|||
"yargs": "^17.7.2"
|
||||
},
|
||||
"optionalDependencies": {
|
||||
"naudiodon": "^2.3.6"
|
||||
"naudiodon": "^2.3.6",
|
||||
"mic": "^2.1.2"
|
||||
},
|
||||
"scripts": {
|
||||
"postinstall": "patch-package",
|
||||
|
|
22
settings.js
22
settings.js
|
@ -34,7 +34,7 @@ const settings = {
|
|||
|
||||
"allow_insecure_coding": false, // allows newAction command and model can write/run code on your computer. enable at own risk
|
||||
"allow_vision": false, // allows vision model to interpret screenshots as inputs
|
||||
"vision_mode": "prompted", // "off", "prompted", or "always"
|
||||
"vision_mode": "off", // "off", "prompted", or "always"
|
||||
"blocked_actions" : ["!checkBlueprint", "!checkBlueprintLevel", "!getBlueprint", "!getBlueprintLevel"] , // commands to disable and remove from docs. Ex: ["!setMode"]
|
||||
"code_timeout_mins": -1, // minutes code is allowed to run. -1 for no timeout
|
||||
"relevant_docs_count": 5, // number of relevant code function docs to select for prompting. -1 for all
|
||||
|
@ -46,15 +46,25 @@ const settings = {
|
|||
"narrate_behavior": true, // chat simple automatic actions ('Picking up item!')
|
||||
"chat_bot_messages": true, // publicly chat messages to other bots
|
||||
|
||||
"stt_transcription": false, // change this to "true" or "false" depending on if you want STT in Mindcraft, STT needs a GroqCloud API key, can be found here: https://console.groq.com/keys
|
||||
"stt_username": "SYSTEM", // Change this to the username the model will respond to.
|
||||
"stt_agent_name": "" // Change the name here to whatever your agent is named, if left empty, will send message to all agents.
|
||||
"speak": false, // allows all bots to speak through system text-to-speech. works on windows, mac, on linux you need to `apt install espeak`
|
||||
"speak": false, // enable text-to-speech
|
||||
"stt_transcription": false, // enable speech-to-text transcription
|
||||
"stt_username": "SERVER", // username for STT messages
|
||||
"stt_agent_name": "", // agent name for STT messages, if empty it will send the STT to all bots
|
||||
|
||||
// STT Audio Detection Settings
|
||||
"stt_rms_threshold": 3000, // Raised from 1000 to reduce false triggers
|
||||
"stt_silence_duration": 2000, // 2 seconds of silence before stopping
|
||||
"stt_min_audio_duration": 0.5, // Minimum audio duration in seconds
|
||||
"stt_max_audio_duration": 45, // Maximum audio duration in seconds
|
||||
"stt_debug_audio": true, // Enable to see what's happening
|
||||
"stt_cooldown_ms": 2000, // Minimum time between recordings
|
||||
"stt_speech_threshold_ratio": 0.05, // Much lower - 5% instead of 15%
|
||||
"stt_consecutive_speech_samples": 3, // Reduced from 5 to 3
|
||||
|
||||
"log_normal_data": false, // Logs all inputs / outputs without reasoning or vision data
|
||||
"log_reasoning_data": false, // Logs only reasoning inputs / outputs
|
||||
"log_vision_data": false, // Logs only vision inputs / outputs
|
||||
|
||||
|
||||
}
|
||||
|
||||
// these environment variables override certain settings
|
||||
|
|
|
@ -100,7 +100,22 @@ export class Claude {
|
|||
if (typeof res === 'string') {
|
||||
res = res.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(logMessagesForClaude), res);
|
||||
|
||||
if (imageData) { // If imageData was part of this sendRequest call
|
||||
let visionPromptText = ""; // Attempt to find the text prompt associated with the image
|
||||
if (turns.length > 0) {
|
||||
const lastTurn = messages[messages.length - 1]; // `messages` is strictFormat(turns)
|
||||
if (lastTurn.role === 'user' && Array.isArray(lastTurn.content)) {
|
||||
const textPart = lastTurn.content.find(part => part.type === 'text');
|
||||
if (textPart) visionPromptText = textPart.text;
|
||||
} else if (lastTurn.role === 'user' && typeof lastTurn.content === 'string') {
|
||||
visionPromptText = lastTurn.content;
|
||||
}
|
||||
}
|
||||
logVision(logMessagesForClaude, imageData, res, visionPromptText);
|
||||
} else {
|
||||
log(JSON.stringify(logMessagesForClaude), res);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
||||
|
@ -121,7 +136,7 @@ export class Claude {
|
|||
const res = await this.sendRequest(turnsForAPIRequest, systemMessage);
|
||||
|
||||
if (imageBuffer && res) {
|
||||
logVision(turns, imageBuffer, res, systemMessage);
|
||||
logVision([{ role: "system", content: systemMessage }].concat(turns), imageBuffer, res, systemMessage);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
|
|
@ -98,7 +98,24 @@ export class DeepSeek {
|
|||
if (typeof res === 'string') {
|
||||
res = res.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), res);
|
||||
|
||||
if (imageData) { // If imageData was part of this sendRequest call
|
||||
const conversationForLogVision = [{ role: "system", content: systemMessage }].concat(turns);
|
||||
let visionPromptText = "";
|
||||
if (turns.length > 0) {
|
||||
const lastTurn = messages[messages.length - 1]; // `messages` is after image processing
|
||||
if (lastTurn.role === 'user' && Array.isArray(lastTurn.content)) {
|
||||
const textPart = lastTurn.content.find(part => part.type === 'text');
|
||||
if (textPart) visionPromptText = textPart.text;
|
||||
} else if (lastTurn.role === 'user' && typeof lastTurn.content === 'string') {
|
||||
// This case might not happen if image is added, as content becomes array
|
||||
visionPromptText = lastTurn.content;
|
||||
}
|
||||
}
|
||||
logVision(conversationForLogVision, imageData, res, visionPromptText);
|
||||
} else {
|
||||
log(JSON.stringify([{ role: "system", content: systemMessage }].concat(turns)), res);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
||||
|
|
|
@ -80,7 +80,21 @@ export class Gemini {
|
|||
if (typeof text === 'string') {
|
||||
text = text.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(originalTurnsForLog), text);
|
||||
|
||||
if (imageData) { // If imageData was part of this sendRequest call
|
||||
let visionPromptText = ""; // Attempt to find the text prompt associated with the image
|
||||
// `contents` is the array sent to the model
|
||||
if (contents.length > 0) {
|
||||
const lastUserTurnParts = contents[contents.length -1].parts;
|
||||
if (Array.isArray(lastUserTurnParts)) {
|
||||
const textPart = lastUserTurnParts.find(part => part.text);
|
||||
if (textPart) visionPromptText = textPart.text;
|
||||
}
|
||||
}
|
||||
logVision(originalTurnsForLog, imageData, text, visionPromptText);
|
||||
} else {
|
||||
log(JSON.stringify(originalTurnsForLog), text);
|
||||
}
|
||||
return text;
|
||||
}
|
||||
|
||||
|
@ -102,7 +116,7 @@ export class Gemini {
|
|||
const text = response.text();
|
||||
console.log('Received.');
|
||||
if (imageBuffer && text) {
|
||||
logVision(turns, imageBuffer, text, prompt);
|
||||
logVision([{role: 'system', content: systemMessage}, ...turns], imageBuffer, text, prompt);
|
||||
}
|
||||
if (!text.includes(stop_seq)) return text;
|
||||
const idx = text.indexOf(stop_seq);
|
||||
|
@ -118,6 +132,7 @@ export class Gemini {
|
|||
if (typeof res === 'string') {
|
||||
res = res.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
// For error cases in vision, still use regular log since there's no image to save
|
||||
log(JSON.stringify(loggedTurnsForError), res);
|
||||
}
|
||||
return res;
|
||||
|
|
|
@ -75,7 +75,7 @@ export class GLHF {
|
|||
if (typeof finalRes === 'string') {
|
||||
finalRes = finalRes.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), finalRes);
|
||||
log(JSON.stringify([{ role: 'system', content: systemMessage }].concat(turns)), finalRes);
|
||||
return finalRes;
|
||||
}
|
||||
|
||||
|
|
|
@ -87,7 +87,25 @@ export class GPT {
|
|||
if (typeof res === 'string') {
|
||||
res = res.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), res);
|
||||
|
||||
if (imageData) {
|
||||
const conversationForLogVision = [{ role: "system", content: systemMessage }].concat(turns);
|
||||
let visionPromptText = "";
|
||||
if (turns.length > 0) {
|
||||
const lastTurn = turns[turns.length - 1];
|
||||
if (lastTurn.role === 'user') {
|
||||
if (typeof lastTurn.content === 'string') {
|
||||
visionPromptText = lastTurn.content;
|
||||
} else if (Array.isArray(lastTurn.content)) {
|
||||
const textPart = lastTurn.content.find(part => part.type === 'text');
|
||||
if (textPart) visionPromptText = textPart.text;
|
||||
}
|
||||
}
|
||||
}
|
||||
logVision(conversationForLogVision, imageData, res, visionPromptText);
|
||||
} else {
|
||||
log(JSON.stringify([{ role: "system", content: systemMessage }].concat(turns)), res);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
||||
|
@ -107,7 +125,8 @@ export class GPT {
|
|||
const res = await this.sendRequest(imageFormattedTurns, systemMessage);
|
||||
|
||||
if (imageBuffer && res) {
|
||||
logVision(original_turns, imageBuffer, res, systemMessage);
|
||||
// The conversationHistory for logVision should be the state *before* this specific vision interaction's prompt was added.
|
||||
logVision([{ role: "system", content: systemMessage }].concat(original_turns), imageBuffer, res, systemMessage);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
|
|
@ -56,7 +56,7 @@ export class Grok {
|
|||
if (typeof finalResponseText === 'string') {
|
||||
finalResponseText = finalResponseText.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), finalResponseText);
|
||||
log(JSON.stringify([{ role: "system", content: systemMessage }].concat(turns)), finalResponseText);
|
||||
return finalResponseText;
|
||||
}
|
||||
|
||||
|
@ -76,7 +76,7 @@ export class Grok {
|
|||
const res = await this.sendRequest(imageFormattedTurns, systemMessage);
|
||||
|
||||
if (imageBuffer && res) {
|
||||
logVision(original_turns, imageBuffer, res, systemMessage);
|
||||
logVision([{ role: "system", content: systemMessage }].concat(original_turns), imageBuffer, res, systemMessage);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
|
|
@ -60,7 +60,7 @@ export class GroqCloudAPI {
|
|||
if (typeof responseText === 'string') {
|
||||
responseText = responseText.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), responseText);
|
||||
log(JSON.stringify([{ role: "system", content: systemMessage }].concat(turns)), responseText);
|
||||
// Original cleaning of <think> tags for the *returned* response (not affecting log)
|
||||
responseText = responseText.replace(/<think>[\s\S]*?<\/think>/g, '').trim();
|
||||
return responseText;
|
||||
|
@ -75,7 +75,7 @@ export class GroqCloudAPI {
|
|||
if (typeof res === 'string') {
|
||||
res = res.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), res);
|
||||
log(JSON.stringify([{ role: "system", content: systemMessage }].concat(turns)), res);
|
||||
return res;
|
||||
}
|
||||
}
|
||||
|
@ -96,7 +96,7 @@ export class GroqCloudAPI {
|
|||
const res = await this.sendRequest(imageMessages, systemMessage);
|
||||
|
||||
if (imageBuffer && res) {
|
||||
logVision(original_turns, imageBuffer, res, systemMessage);
|
||||
logVision([{ role: "system", content: systemMessage }].concat(original_turns), imageBuffer, res, systemMessage);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
|
|
@ -25,8 +25,7 @@ export class HuggingFace {
|
|||
const prompt = toSinglePrompt(turns, null, stop_seq);
|
||||
const model_name = this.model_name || 'meta-llama/Meta-Llama-3-8B';
|
||||
const logInputMessages = [{role: 'system', content: systemMessage}, ...turns];
|
||||
const input = systemMessage + "
|
||||
" + prompt;
|
||||
const input = systemMessage + "" + prompt;
|
||||
const maxAttempts = 5;
|
||||
let attempt = 0;
|
||||
let finalRes = null;
|
||||
|
|
|
@ -116,7 +116,7 @@ export class Hyperbolic {
|
|||
if (typeof finalRes === 'string') {
|
||||
finalRes = finalRes.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), finalRes);
|
||||
log(JSON.stringify([{ role: 'system', content: systemMessage }].concat(turns)), finalRes);
|
||||
return finalRes;
|
||||
}
|
||||
|
||||
|
|
|
@ -93,7 +93,22 @@ export class Local {
|
|||
if (typeof finalRes === 'string') {
|
||||
finalRes = finalRes.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), finalRes);
|
||||
|
||||
if (imageData) { // If imageData was part of this sendRequest call
|
||||
// `messages` here already includes the system prompt and image data
|
||||
let visionPromptText = "";
|
||||
if (messages.length > 0) {
|
||||
const lastTurn = messages[messages.length -1];
|
||||
// For Ollama, content is a string, images is a separate array.
|
||||
if (lastTurn.role === 'user' && typeof lastTurn.content === 'string') {
|
||||
visionPromptText = lastTurn.content;
|
||||
}
|
||||
}
|
||||
logVision(messages, imageData, finalRes, visionPromptText);
|
||||
} else {
|
||||
// messages already includes system prompt if no imageData
|
||||
log(JSON.stringify(messages), finalRes);
|
||||
}
|
||||
return finalRes;
|
||||
}
|
||||
|
||||
|
|
|
@ -48,7 +48,7 @@ export class OpenRouter {
|
|||
return 'No response received.';
|
||||
}
|
||||
|
||||
const logMessages = [{ role: "system", content: processedSystemMessage }].concat(turns);
|
||||
const logMessages = [{ role: "system", content: systemMessage }].concat(turns);
|
||||
|
||||
if (completion.choices[0].finish_reason === 'length') {
|
||||
throw new Error('Context length exceeded');
|
||||
|
@ -58,23 +58,15 @@ export class OpenRouter {
|
|||
try{
|
||||
const reasoning = '<think>\n' + completion.choices[0].message.reasoning + '</think>\n';
|
||||
const content = completion.choices[0].message.content;
|
||||
|
||||
// --- VISION LOGGING ---
|
||||
if (visionImageBuffer) {
|
||||
logVision(turns, visionImageBuffer, reasoning + "\n" + content, visionMessage);
|
||||
} else {
|
||||
log(JSON.stringify(logMessages), reasoning + "\n" + content);
|
||||
}
|
||||
// Standard logging for text-based responses
|
||||
log(JSON.stringify(logMessages), reasoning + "\n" + content);
|
||||
res = content;
|
||||
} catch {}
|
||||
} else {
|
||||
try {
|
||||
res = completion.choices[0].message.content;
|
||||
if (visionImageBuffer) {
|
||||
logVision(turns, visionImageBuffer, res, visionMessage);
|
||||
} else {
|
||||
log(JSON.stringify(logMessages), res);
|
||||
}
|
||||
// Standard logging for text-based responses
|
||||
log(JSON.stringify(logMessages), res);
|
||||
} catch {
|
||||
console.warn("Unable to log due to unknown error!");
|
||||
}
|
||||
|
@ -101,12 +93,13 @@ export class OpenRouter {
|
|||
return finalRes;
|
||||
}
|
||||
|
||||
async sendVisionRequest(messages, systemMessage, imageBuffer) {
|
||||
const imageMessages = [...messages];
|
||||
imageMessages.push({
|
||||
async sendVisionRequest(original_turns, systemMessage, imageBuffer) { // Renamed messages to original_turns
|
||||
const imageFormattedTurns = [...original_turns];
|
||||
imageFormattedTurns.push({
|
||||
role: "user",
|
||||
content: [
|
||||
{ type: "text", text: systemMessage },
|
||||
// The systemMessage is used as the text prompt accompanying the image here
|
||||
{ type: "text", text: systemMessage },
|
||||
{
|
||||
type: "image_url",
|
||||
image_url: {
|
||||
|
@ -116,10 +109,17 @@ export class OpenRouter {
|
|||
]
|
||||
});
|
||||
|
||||
// sendVisionRequest formats its own message array; sendRequest here should not process new imageData.
|
||||
// Pass systemMessage and stop_seq as originally intended by sendRequest.
|
||||
return this.sendRequest(imageMessages, systemMessage, null, stop_seq);
|
||||
|
||||
// Pass the main systemMessage to sendRequest, as it expects a system prompt.
|
||||
// The image-specific prompt is part of imageFormattedTurns.
|
||||
const res = await this.sendRequest(imageFormattedTurns, systemMessage, null, stop_seq);
|
||||
|
||||
if (imageBuffer && res) {
|
||||
// For logVision, conversationHistory should be the original turns + system prompt.
|
||||
// The visionMessage (text prompt for the image) is systemMessage in this context.
|
||||
logVision([{ role: "system", content: systemMessage }].concat(original_turns), imageBuffer, res, systemMessage);
|
||||
}
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
async embed(text) {
|
||||
|
|
|
@ -364,6 +364,9 @@ export class Prompter {
|
|||
console.log("Generated response:", generation);
|
||||
await this._saveLog(prompt, messages, generation, 'conversation');
|
||||
|
||||
// Remove the incorrect logVision call here since sendRequest should handle it
|
||||
// The model's sendRequest method will call logVision if imageData was provided
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error during message generation or file writing:', error);
|
||||
continue;
|
||||
|
@ -465,26 +468,15 @@ export class Prompter {
|
|||
}
|
||||
|
||||
async _saveLog(prompt, messages, generation, tag) {
|
||||
// NEW LOGIC STARTS
|
||||
switch (tag) {
|
||||
case 'conversation':
|
||||
case 'coding': // Assuming coding logs fall under normal data
|
||||
case 'memSaving':
|
||||
if (!settings.log_normal_data) return;
|
||||
break;
|
||||
// Add case for 'vision' if prompter.js starts logging vision prompts/responses via _saveLog
|
||||
// case 'vision':
|
||||
// if (!settings.log_vision_data) return;
|
||||
// break;
|
||||
default:
|
||||
// If it's an unknown tag, perhaps log it if general logging is on, or ignore.
|
||||
// For safety, let's assume if it's not specified, it doesn't get logged unless a general flag is on.
|
||||
// However, the goal is to use specific flags. So, if a new tag appears, this logic should be updated.
|
||||
// For now, if it doesn't match known tags that map to a setting, it won't log.
|
||||
return;
|
||||
}
|
||||
// NEW LOGIC ENDS
|
||||
|
||||
const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
|
||||
let logEntry;
|
||||
let task_id = this.agent.task.task_id;
|
||||
|
@ -511,6 +503,4 @@ export class Prompter {
|
|||
logFile = path.join(logDir, logFile);
|
||||
await fs.appendFile(logFile, String(logEntry), 'utf-8');
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
|
|
@ -85,7 +85,24 @@ export class Qwen {
|
|||
if (typeof res === 'string') {
|
||||
res = res.replace(/<thinking>/g, '<think>').replace(/<\/thinking>/g, '</think>');
|
||||
}
|
||||
log(JSON.stringify(messages), res);
|
||||
|
||||
if (imageData) { // If imageData was part of this sendRequest call
|
||||
// `messages` here includes system prompt and image data
|
||||
let visionPromptText = "";
|
||||
if (messages.length > 0) {
|
||||
const lastTurn = messages[messages.length - 1];
|
||||
if (lastTurn.role === 'user' && Array.isArray(lastTurn.content)) {
|
||||
const textPart = lastTurn.content.find(part => part.text);
|
||||
if (textPart) visionPromptText = textPart.text;
|
||||
} else if (lastTurn.role === 'user' && typeof lastTurn.content === 'string'){
|
||||
visionPromptText = lastTurn.content;
|
||||
}
|
||||
}
|
||||
logVision(messages, imageData, res, visionPromptText);
|
||||
} else {
|
||||
// messages already includes system prompt if no imageData
|
||||
log(JSON.stringify(messages), res);
|
||||
}
|
||||
return res;
|
||||
}
|
||||
|
||||
|
@ -117,4 +134,4 @@ export class Qwen {
|
|||
throw new Error('Max retries reached, request failed.');
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,7 +1,5 @@
|
|||
import settings from '../../settings.js';
|
||||
import { GroqCloudTTS } from '../models/groq.js';
|
||||
// import portAudio from 'naudiodon'; // Original static import
|
||||
// const { AudioIO, SampleFormat16Bit } = portAudio; // Original destructuring
|
||||
import wav from 'wav';
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
|
@ -12,37 +10,59 @@ import { getIO, getAllInGameAgentNames } from '../server/mind_server.js';
|
|||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
const __dirname = path.dirname(__filename);
|
||||
|
||||
// --- Conditional Naudiodon Import ---
|
||||
// Import the audio libraries conditionally
|
||||
let portAudio;
|
||||
let AudioIO;
|
||||
let SampleFormat16Bit;
|
||||
let mic; // For mic library
|
||||
let activeAudioLibrary = null; // 'naudiodon' or 'mic'
|
||||
|
||||
(async () => {
|
||||
try {
|
||||
const naudiodonModule = await import('naudiodon');
|
||||
portAudio = naudiodonModule.default; // CommonJS modules often export functionality on 'default' when imported into ES modules
|
||||
portAudio = naudiodonModule.default;
|
||||
if (portAudio && typeof portAudio.AudioIO === 'function' && typeof portAudio.SampleFormat16Bit !== 'undefined') {
|
||||
AudioIO = portAudio.AudioIO;
|
||||
SampleFormat16Bit = portAudio.SampleFormat16Bit;
|
||||
activeAudioLibrary = 'naudiodon';
|
||||
console.log('[STT] naudiodon loaded successfully.');
|
||||
} else if (naudiodonModule.AudioIO && typeof naudiodonModule.SampleFormat16Bit !== 'undefined') {
|
||||
// Fallback if 'default' is not used and properties are directly on the module
|
||||
AudioIO = naudiodonModule.AudioIO;
|
||||
SampleFormat16Bit = naudiodonModule.SampleFormat16Bit;
|
||||
portAudio = naudiodonModule; // Assign the module itself to portAudio for consistency if needed elsewhere
|
||||
portAudio = naudiodonModule;
|
||||
activeAudioLibrary = 'naudiodon';
|
||||
console.log('[STT] naudiodon loaded successfully (direct properties).');
|
||||
}
|
||||
else {
|
||||
} else {
|
||||
throw new Error('AudioIO or SampleFormat16Bit not found in naudiodon module exports.');
|
||||
}
|
||||
} catch (err) {
|
||||
console.warn(`[STT] Failed to load naudiodon, Speech-to-Text will be disabled. Error: ${err.message}`);
|
||||
console.warn(`[STT] Failed to load naudiodon. Error: ${err.message}`);
|
||||
portAudio = null;
|
||||
AudioIO = null;
|
||||
SampleFormat16Bit = null;
|
||||
|
||||
// Attempt to load mic if naudiodon fails
|
||||
try {
|
||||
const micModule = await import('mic');
|
||||
mic = micModule.default; // Assuming mic is also a CommonJS module typically
|
||||
if (mic && typeof mic === 'function') { // mic is often a constructor function
|
||||
activeAudioLibrary = 'mic';
|
||||
console.log('[STT] mic loaded successfully as an alternative.');
|
||||
} else if (micModule.Mic) { // Some modules might export it as Mic
|
||||
mic = micModule.Mic;
|
||||
activeAudioLibrary = 'mic';
|
||||
console.log('[STT] mic (Mic) loaded successfully as an alternative.');
|
||||
}
|
||||
else {
|
||||
throw new Error('Mic constructor not found in mic module exports.');
|
||||
}
|
||||
} catch (micErr) {
|
||||
console.warn(`[STT] Failed to load mic as well. Speech-to-Text will be disabled. Error: ${micErr.message}`);
|
||||
mic = null;
|
||||
activeAudioLibrary = null;
|
||||
}
|
||||
}
|
||||
// Initialize TTS after attempting to load naudiodon
|
||||
// Initialize TTS after attempting to load audio libraries
|
||||
initTTS();
|
||||
})();
|
||||
|
||||
|
@ -59,25 +79,35 @@ for (const file of leftover) {
|
|||
}
|
||||
}
|
||||
|
||||
// Configuration
|
||||
const RMS_THRESHOLD = 500; // Lower threshold for faint audio
|
||||
const SILENCE_DURATION = 2000; // 2 seconds of silence after speech => stop
|
||||
// Configuration from settings
|
||||
const RMS_THRESHOLD = settings.stt_rms_threshold || 8000;
|
||||
const SILENCE_DURATION = settings.stt_silence_duration || 2000;
|
||||
const MIN_AUDIO_DURATION = settings.stt_min_audio_duration || 0.5;
|
||||
const MAX_AUDIO_DURATION = settings.stt_max_audio_duration || 15;
|
||||
const DEBUG_AUDIO = settings.stt_debug_audio || false;
|
||||
const COOLDOWN_MS = settings.stt_cooldown_ms || 2000;
|
||||
const SPEECH_THRESHOLD_RATIO = settings.stt_speech_threshold_ratio || 0.15;
|
||||
const CONSECUTIVE_SPEECH_SAMPLES = settings.stt_consecutive_speech_samples || 5;
|
||||
const SAMPLE_RATE = 16000;
|
||||
const BIT_DEPTH = 16;
|
||||
const STT_USERNAME = settings.stt_username || "SERVER"; // Name that appears as sender
|
||||
const STT_AGENT_NAME = settings.stt_agent_name || ""; // If blank, broadcast to all
|
||||
const STT_USERNAME = settings.stt_username || "SERVER";
|
||||
const STT_AGENT_NAME = settings.stt_agent_name || "";
|
||||
|
||||
// Guards to prevent multiple overlapping recordings
|
||||
let isRecording = false; // Ensures only one recordAndTranscribeOnce at a time
|
||||
let sttRunning = false; // Ensures continuousLoop is started only once
|
||||
let isRecording = false;
|
||||
let sttRunning = false;
|
||||
let sttInitialized = false;
|
||||
let lastRecordingEndTime = 0;
|
||||
|
||||
/**
|
||||
* Records one session, transcribes, and sends to MindServer as a chat message
|
||||
*/
|
||||
async function recordAndTranscribeOnce() {
|
||||
// Check cooldown period
|
||||
const timeSinceLastRecording = Date.now() - lastRecordingEndTime;
|
||||
if (timeSinceLastRecording < COOLDOWN_MS) {
|
||||
return null;
|
||||
}
|
||||
|
||||
// If another recording is in progress, just skip
|
||||
if (isRecording) {
|
||||
console.log("[STT] Another recording is still in progress; skipping new record attempt.");
|
||||
return null;
|
||||
}
|
||||
isRecording = true;
|
||||
|
@ -89,33 +119,37 @@ async function recordAndTranscribeOnce() {
|
|||
bitDepth: BIT_DEPTH
|
||||
});
|
||||
|
||||
// This is where AudioIO is crucial
|
||||
if (!AudioIO || !SampleFormat16Bit) {
|
||||
console.warn("[STT] AudioIO or SampleFormat16Bit not available. Cannot record audio.");
|
||||
isRecording = false;
|
||||
return null;
|
||||
if (!activeAudioLibrary) {
|
||||
console.warn("[STT] No audio recording library available.");
|
||||
isRecording = false;
|
||||
return null;
|
||||
}
|
||||
|
||||
const ai = new AudioIO({
|
||||
inOptions: {
|
||||
channelCount: 1,
|
||||
sampleFormat: SampleFormat16Bit,
|
||||
sampleRate: SAMPLE_RATE,
|
||||
deviceId: -1,
|
||||
closeOnError: true
|
||||
}
|
||||
});
|
||||
|
||||
let audioInterface;
|
||||
let audioStream;
|
||||
let recording = true;
|
||||
let hasHeardSpeech = false;
|
||||
let silenceTimer = null;
|
||||
let finished = false; // Guard to ensure final processing is done only once
|
||||
let maxDurationTimer = null;
|
||||
let finished = false;
|
||||
|
||||
// Smart speech detection variables
|
||||
let speechSampleCount = 0;
|
||||
let totalSampleCount = 0;
|
||||
let consecutiveSpeechSamples = 0;
|
||||
let speechLevels = [];
|
||||
let averageSpeechLevel = 0;
|
||||
let adaptiveThreshold = RMS_THRESHOLD;
|
||||
|
||||
// Helper to reset silence timer
|
||||
function resetSilenceTimer() {
|
||||
if (silenceTimer) clearTimeout(silenceTimer);
|
||||
if (hasHeardSpeech) {
|
||||
silenceTimer = setTimeout(() => stopRecording(), SILENCE_DURATION);
|
||||
// Only start silence timer if actual speech has been detected
|
||||
if (hasHeardSpeech && recording) { // also check `recording` to prevent timer after explicit stop
|
||||
silenceTimer = setTimeout(() => {
|
||||
if (DEBUG_AUDIO) console.log('[STT] Silence timeout reached, stopping recording.');
|
||||
stopRecording();
|
||||
}, SILENCE_DURATION);
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -123,14 +157,81 @@ async function recordAndTranscribeOnce() {
|
|||
function stopRecording() {
|
||||
if (!recording) return;
|
||||
recording = false;
|
||||
ai.quit();
|
||||
fileWriter.end();
|
||||
|
||||
if (silenceTimer) clearTimeout(silenceTimer);
|
||||
if (maxDurationTimer) clearTimeout(maxDurationTimer);
|
||||
|
||||
if (activeAudioLibrary === 'naudiodon' && audioInterface) {
|
||||
try {
|
||||
audioInterface.quit();
|
||||
} catch (err) {
|
||||
// Silent error handling
|
||||
}
|
||||
} else if (activeAudioLibrary === 'mic' && audioInterface) {
|
||||
try {
|
||||
audioInterface.stop();
|
||||
} catch (err) {
|
||||
// Silent error handling
|
||||
}
|
||||
}
|
||||
|
||||
if (fileWriter && !fileWriter.closed) {
|
||||
fileWriter.end();
|
||||
}
|
||||
}
|
||||
|
||||
// We wrap everything in a promise so we can await the transcription
|
||||
return new Promise((resolve, reject) => {
|
||||
// Attach event handlers
|
||||
ai.on('data', (chunk) => {
|
||||
// Set maximum recording duration timer
|
||||
maxDurationTimer = setTimeout(() => {
|
||||
stopRecording();
|
||||
}, MAX_AUDIO_DURATION * 1000);
|
||||
|
||||
if (activeAudioLibrary === 'naudiodon') {
|
||||
if (!AudioIO || !SampleFormat16Bit) {
|
||||
isRecording = false;
|
||||
return reject(new Error("Naudiodon not available"));
|
||||
}
|
||||
audioInterface = new AudioIO({
|
||||
inOptions: {
|
||||
channelCount: 1,
|
||||
sampleFormat: SampleFormat16Bit,
|
||||
sampleRate: SAMPLE_RATE,
|
||||
deviceId: -1,
|
||||
closeOnError: true
|
||||
}
|
||||
});
|
||||
audioStream = audioInterface;
|
||||
|
||||
audioStream.on('error', (err) => {
|
||||
cleanupAndResolve(null);
|
||||
});
|
||||
|
||||
} else if (activeAudioLibrary === 'mic') {
|
||||
audioInterface = new mic({
|
||||
rate: String(SAMPLE_RATE),
|
||||
channels: '1',
|
||||
bitwidth: String(BIT_DEPTH),
|
||||
endian: 'little',
|
||||
encoding: 'signed-integer',
|
||||
device: 'default',
|
||||
debug: false // Don't use mic's debug, we have our own
|
||||
});
|
||||
audioStream = audioInterface.getAudioStream();
|
||||
|
||||
audioStream.on('error', (err) => {
|
||||
cleanupAndResolve(null);
|
||||
});
|
||||
|
||||
audioStream.on('processExitComplete', () => {
|
||||
// Silent
|
||||
});
|
||||
}
|
||||
|
||||
// Common event handling for data (applies to both naudiodon ai and micStream)
|
||||
audioStream.on('data', (chunk) => {
|
||||
if (!recording) return;
|
||||
|
||||
fileWriter.write(chunk);
|
||||
|
||||
// Calculate RMS for threshold detection
|
||||
|
@ -141,40 +242,65 @@ async function recordAndTranscribeOnce() {
|
|||
sumSquares += sample * sample;
|
||||
}
|
||||
const rms = Math.sqrt(sumSquares / sampleCount);
|
||||
totalSampleCount++;
|
||||
|
||||
// If RMS passes threshold, we've heard speech
|
||||
if (rms > RMS_THRESHOLD) {
|
||||
if (!hasHeardSpeech) {
|
||||
hasHeardSpeech = true;
|
||||
// Simplified speech detection logic
|
||||
if (rms > adaptiveThreshold) {
|
||||
speechSampleCount++;
|
||||
consecutiveSpeechSamples++;
|
||||
speechLevels.push(rms);
|
||||
|
||||
// Update adaptive threshold based on actual speech levels
|
||||
if (speechLevels.length > 10) {
|
||||
averageSpeechLevel = speechLevels.reduce((a, b) => a + b, 0) / speechLevels.length;
|
||||
adaptiveThreshold = Math.max(RMS_THRESHOLD, averageSpeechLevel * 0.4); // 40% of average speech level
|
||||
}
|
||||
resetSilenceTimer();
|
||||
|
||||
// Trigger speech detection much more easily
|
||||
if (!hasHeardSpeech) {
|
||||
// Either consecutive samples OR sufficient ratio
|
||||
const speechRatio = speechSampleCount / totalSampleCount;
|
||||
if (consecutiveSpeechSamples >= 3 || speechRatio >= 0.05) { // Much lower thresholds
|
||||
hasHeardSpeech = true;
|
||||
console.log(`[STT] Speech detected! (consecutive: ${consecutiveSpeechSamples}, ratio: ${(speechRatio * 100).toFixed(1)}%)`);
|
||||
}
|
||||
}
|
||||
|
||||
if (hasHeardSpeech) {
|
||||
resetSilenceTimer();
|
||||
}
|
||||
} else {
|
||||
consecutiveSpeechSamples = 0; // Reset consecutive counter
|
||||
}
|
||||
});
|
||||
|
||||
ai.on('error', (err) => {
|
||||
console.error("[STT] AudioIO error:", err);
|
||||
cleanupListeners();
|
||||
// Don't reject here, as continuousLoop should continue. Resolve with null.
|
||||
resolve(null);
|
||||
});
|
||||
|
||||
fileWriter.on('finish', async () => {
|
||||
if (finished) return;
|
||||
finished = true;
|
||||
lastRecordingEndTime = Date.now();
|
||||
|
||||
try {
|
||||
// Check audio duration
|
||||
const stats = fs.statSync(outFile);
|
||||
const headerSize = 44; // standard WAV header size
|
||||
const headerSize = 44;
|
||||
const dataSize = stats.size - headerSize;
|
||||
const duration = dataSize / (SAMPLE_RATE * (BIT_DEPTH / 8));
|
||||
if (duration < 2.75) {
|
||||
console.log("[STT] Audio too short (<2.75s); discarding.");
|
||||
fs.unlink(outFile, () => {});
|
||||
cleanupListeners();
|
||||
return resolve(null);
|
||||
|
||||
const speechPercentage = totalSampleCount > 0 ? (speechSampleCount / totalSampleCount) * 100 : 0;
|
||||
|
||||
if (DEBUG_AUDIO) {
|
||||
console.log(`[STT] Audio processed: ${duration.toFixed(2)}s, speech detected: ${hasHeardSpeech}, speech %: ${speechPercentage.toFixed(1)}%`);
|
||||
}
|
||||
|
||||
if (duration < MIN_AUDIO_DURATION) {
|
||||
cleanupAndResolve(null);
|
||||
return;
|
||||
}
|
||||
|
||||
if (!hasHeardSpeech || speechPercentage < 3) { // Lowered from 15% to 3%
|
||||
cleanupAndResolve(null);
|
||||
return;
|
||||
}
|
||||
|
||||
// Transcribe
|
||||
const groqTTS = new GroqCloudTTS();
|
||||
const text = await groqTTS.transcribe(outFile, {
|
||||
model: "distil-whisper-large-v3-en",
|
||||
|
@ -184,82 +310,90 @@ async function recordAndTranscribeOnce() {
|
|||
temperature: 0.0
|
||||
});
|
||||
|
||||
fs.unlink(outFile, () => {}); // cleanup WAV file
|
||||
|
||||
// Basic check for empty or whitespace
|
||||
if (!text || !text.trim()) {
|
||||
console.log("[STT] Transcription empty; discarding.");
|
||||
cleanupListeners();
|
||||
return resolve(null);
|
||||
cleanupAndResolve(null);
|
||||
return;
|
||||
}
|
||||
|
||||
// Heuristic checks to determine if the transcription is genuine
|
||||
|
||||
// 1. Ensure at least one alphabetical character
|
||||
// Enhanced validation
|
||||
if (!/[A-Za-z]/.test(text)) {
|
||||
console.log("[STT] Transcription has no letters; discarding.");
|
||||
cleanupListeners();
|
||||
return resolve(null);
|
||||
cleanupAndResolve(null);
|
||||
return;
|
||||
}
|
||||
|
||||
// 2. Check for gibberish repeated sequences
|
||||
if (/([A-Za-z])\1{3,}/.test(text)) {
|
||||
console.log("[STT] Transcription looks like gibberish; discarding.");
|
||||
cleanupListeners();
|
||||
return resolve(null);
|
||||
cleanupAndResolve(null);
|
||||
return;
|
||||
}
|
||||
|
||||
// Filter out common false positives
|
||||
const falsePositives = ["thank you", "thanks", "bye", ".", ",", "?", "!", "um", "uh", "hmm"];
|
||||
if (falsePositives.includes(text.trim().toLowerCase())) {
|
||||
cleanupAndResolve(null);
|
||||
return;
|
||||
}
|
||||
|
||||
// 3. Check transcription length, with allowed greetings
|
||||
const letterCount = text.replace(/[^A-Za-z]/g, "").length;
|
||||
const normalizedText = text.trim().toLowerCase();
|
||||
const allowedGreetings = new Set(["hi", "hello", "greetings", "hey"]);
|
||||
const allowedGreetings = new Set(["hi", "hello", "hey", "yes", "no", "okay"]);
|
||||
|
||||
if (letterCount < 8 && !allowedGreetings.has(normalizedText)) {
|
||||
console.log("[STT] Transcription too short and not an allowed greeting; discarding.");
|
||||
cleanupListeners();
|
||||
return resolve(null);
|
||||
if (letterCount < 2 && !allowedGreetings.has(normalizedText)) {
|
||||
cleanupAndResolve(null);
|
||||
return;
|
||||
}
|
||||
|
||||
console.log("[STT] Transcription:", text);
|
||||
// Only log successful transcriptions
|
||||
console.log("[STT] Transcribed:", text);
|
||||
|
||||
// Format message so it looks like: "[SERVER] message"
|
||||
const finalMessage = `[${STT_USERNAME}] ${text}`;
|
||||
|
||||
// If STT_AGENT_NAME is empty, broadcast to all agents
|
||||
if (!STT_AGENT_NAME.trim()) {
|
||||
const agentNames = getAllInGameAgentNames(); // from mind_server
|
||||
const agentNames = getAllInGameAgentNames();
|
||||
for (const agentName of agentNames) {
|
||||
getIO().emit('send-message', agentName, finalMessage);
|
||||
}
|
||||
} else {
|
||||
// Otherwise, send only to the specified agent
|
||||
getIO().emit('send-message', STT_AGENT_NAME, finalMessage);
|
||||
}
|
||||
|
||||
cleanupListeners();
|
||||
resolve(text);
|
||||
cleanupAndResolve(text);
|
||||
} catch (err) {
|
||||
console.error("[STT] Error during transcription or sending message:", err);
|
||||
fs.unlink(outFile, () => {}); // Attempt cleanup even on error
|
||||
cleanupListeners();
|
||||
reject(err); // Propagate error for continuousLoop to catch
|
||||
cleanupAndResolve(null);
|
||||
}
|
||||
});
|
||||
|
||||
ai.start();
|
||||
function cleanupAndResolve(result) {
|
||||
if (silenceTimer) clearTimeout(silenceTimer);
|
||||
if (maxDurationTimer) clearTimeout(maxDurationTimer);
|
||||
|
||||
try {
|
||||
if (fs.existsSync(outFile)) {
|
||||
fs.unlinkSync(outFile);
|
||||
}
|
||||
} catch (err) {
|
||||
// Silent cleanup
|
||||
}
|
||||
|
||||
function cleanupListeners() {
|
||||
if (ai && typeof ai.removeAllListeners === 'function') {
|
||||
ai.removeAllListeners('data');
|
||||
ai.removeAllListeners('error');
|
||||
if (audioStream && typeof audioStream.removeAllListeners === 'function') {
|
||||
audioStream.removeAllListeners();
|
||||
}
|
||||
if (fileWriter && typeof fileWriter.removeAllListeners === 'function') {
|
||||
fileWriter.removeAllListeners('finish');
|
||||
fileWriter.removeAllListeners();
|
||||
}
|
||||
if (silenceTimer) clearTimeout(silenceTimer);
|
||||
|
||||
// release lock
|
||||
isRecording = false;
|
||||
resolve(result);
|
||||
}
|
||||
|
||||
// Start recording
|
||||
try {
|
||||
if (activeAudioLibrary === 'naudiodon') {
|
||||
audioInterface.start();
|
||||
} else if (activeAudioLibrary === 'mic') {
|
||||
audioInterface.start();
|
||||
}
|
||||
} catch (err) {
|
||||
cleanupAndResolve(null);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
@ -268,56 +402,69 @@ async function recordAndTranscribeOnce() {
|
|||
* Runs recording sessions sequentially, so only one at a time
|
||||
*/
|
||||
async function continuousLoop() {
|
||||
// This check is now more critical as AudioIO might not be available
|
||||
if (!AudioIO) {
|
||||
console.warn("[STT] AudioIO not available. STT continuous loop cannot start.");
|
||||
sttRunning = false; // Ensure this is marked as not running
|
||||
if (!activeAudioLibrary) {
|
||||
console.warn("[STT] No audio recording library available. STT disabled.");
|
||||
sttRunning = false;
|
||||
return;
|
||||
}
|
||||
|
||||
while (sttRunning) { // Check sttRunning to allow loop to terminate if STT is disabled later
|
||||
console.log("[STT] Speech-to-text active (Groq Whisper)");
|
||||
let consecutiveErrors = 0;
|
||||
const maxConsecutiveErrors = 3;
|
||||
|
||||
while (sttRunning) {
|
||||
try {
|
||||
await recordAndTranscribeOnce();
|
||||
const result = await recordAndTranscribeOnce();
|
||||
consecutiveErrors = 0;
|
||||
|
||||
// Longer delay between recordings
|
||||
if (sttRunning) {
|
||||
await new Promise(res => setTimeout(res, 1000));
|
||||
}
|
||||
} catch (err) {
|
||||
// Errors from recordAndTranscribeOnce (like transcription errors) are caught here
|
||||
console.error("[STT Error in continuousLoop]", err);
|
||||
// Potentially add a longer delay or a backoff mechanism if errors are persistent
|
||||
}
|
||||
// short gap, but only if stt is still supposed to be running
|
||||
if (sttRunning) {
|
||||
await new Promise(res => setTimeout(res, 1000));
|
||||
consecutiveErrors++;
|
||||
|
||||
if (consecutiveErrors >= maxConsecutiveErrors) {
|
||||
console.error("[STT] Too many errors, stopping STT.");
|
||||
sttRunning = false;
|
||||
break;
|
||||
}
|
||||
|
||||
if (sttRunning) {
|
||||
const delay = 3000 * consecutiveErrors;
|
||||
await new Promise(res => setTimeout(res, delay));
|
||||
}
|
||||
}
|
||||
}
|
||||
console.log("[STT] Continuous loop ended.");
|
||||
}
|
||||
|
||||
export function initTTS() {
|
||||
if (!settings.stt_transcription) {
|
||||
console.log("[STT] STT transcription is disabled in settings.");
|
||||
sttRunning = false; // Ensure it's marked as not running
|
||||
sttRunning = false;
|
||||
return;
|
||||
}
|
||||
|
||||
// This check is crucial: if AudioIO (from naudiodon) wasn't loaded, STT cannot run.
|
||||
if (!AudioIO) {
|
||||
console.warn("[STT] AudioIO is not available (naudiodon might have failed to load). STT functionality cannot be initialized.");
|
||||
sttRunning = false; // Ensure sttRunning is false if it was somehow true
|
||||
if (!activeAudioLibrary) {
|
||||
console.warn("[STT] No audio recording library available (naudiodon or mic failed to load). STT functionality cannot be initialized.");
|
||||
sttRunning = false;
|
||||
return;
|
||||
}
|
||||
|
||||
if (sttRunning) {
|
||||
console.log("[STT] STT loop already running; skipping re-init.");
|
||||
if (sttRunning || sttInitialized) {
|
||||
console.log("[STT] STT already initialized; skipping re-init.");
|
||||
return;
|
||||
}
|
||||
|
||||
console.log("[STT] Initializing STT...");
|
||||
sttRunning = true; // Set before starting the loop
|
||||
sttRunning = true;
|
||||
sttInitialized = true;
|
||||
|
||||
continuousLoop().catch((err) => {
|
||||
console.error("[STT] continuousLoop crashed unexpectedly:", err);
|
||||
sttRunning = false; // Mark as not running if it crashes
|
||||
});
|
||||
setTimeout(() => {
|
||||
continuousLoop().catch((err) => {
|
||||
console.error("[STT] continuousLoop crashed unexpectedly:", err);
|
||||
sttRunning = false;
|
||||
sttInitialized = false;
|
||||
});
|
||||
}, 2000);
|
||||
}
|
||||
|
||||
// Moved initTTS() call into the async IIFE after naudiodon import attempt.
|
||||
// initTTS();
|
||||
|
|
Loading…
Add table
Reference in a new issue