This commit addresses an issue where Speech-to-Text (STT) functionality would be disabled if the `naudiodon` package failed to build during installation.
The `src/process/tts_process.js` file (which handles STT) has been modified to:
1. Attempt to load `naudiodon` first.
2. If `naudiodon` fails to load, attempt to load the `mic` package as an alternative.
3. The audio recording logic has been adapted to work with both `naudiodon` and `mic` APIs.
Additionally, `package.json` has been updated to move `mic` from `dependencies` to `optionalDependencies`, making its behavior consistent with `naudiodon`.
This change provides a fallback mechanism for audio recording, increasing the robustness of the STT feature across different platforms and environments where `naudiodon` might have build issues.
This commit addresses build failures related to the `naudiodon` package encountered during `npm install`.
Changes Made:
1. **`naudiodon` as Optional Dependency:**
* Moved `naudiodon` from `dependencies` to `optionalDependencies` in `package.json`. This allows `npm install` to succeed even if `naudiodon` fails to build on your system, preventing the installation from being blocked.
2. **Graceful Handling of `naudiodon` Absence:**
* Modified `src/process/tts_process.js` to dynamically import `naudiodon`.
* If `naudiodon` is not found or fails to load, the Speech-to-Text (STT) functionality that relies on it for microphone input will be gracefully disabled.
* The application will log a warning in this case but will otherwise start and run normally.
3. **Documentation of Prerequisites:**
* Updated `README.md` with a new section detailing the system prerequisites for building `naudiodon` successfully on Linux, Windows, and macOS. This includes commands for installing necessary C++ compilers, development tools, and PortAudio libraries.
* Added notes to the README explaining that `naudiodon` is used for STT and is optional.
**Summary of Approach:**
The primary goal was to resolve the `npm install` error caused by `naudiodon`. By making it an optional dependency and ensuring the application handles its absence, you can now install and run the core application without needing to immediately troubleshoot `naudiodon` build issues. If you wish to use the STT feature, you can refer to the updated README for guidance on installing the necessary system dependencies for `naudiodon`.
**Note on Your Feedback (STT Alternatives):**
You expressed a desire for STT to work even without `naudiodon`, possibly using alternative packages. While this commit ensures the application no longer errors out due to `naudiodon` and makes STT optionally functional, it does not replace `naudiodon` with an alternative for STT audio input. Exploring and integrating alternative cross-platform audio input libraries for STT would be a separate task.
This set of changes should improve the installation experience across different platforms.
This commit addresses several aspects of the vision logging system:
1. **Always Active Vision Logging:**
* Ensures that when `settings.vision_mode` is 'always', a vision log entry is created each time a message is handled.
* The full conversation history is now correctly formatted into a JSON string and passed as the `visionMessage` (4th argument) to `logger.logVision`. This ensures the entire input context is logged for these "always active" vision captures, similar to 'normal' and 'reasoning' text logs.
* I implemented this by adding a `formatHistoryForVisionLog` helper function to `Agent.js` and calling it within `handleMessage` to prepare the history string. This approach was chosen due to difficulties in directly modifying `logger.js` to always use its internal full history formatter.
2. **Comments:**
* I added detailed comments in `agent.js` to explain the `formatHistoryForVisionLog` helper function and the logic for "always active" vision logging, including the rationale for the approach.
* I clarified how `latestScreenshotPath` is managed in relation to "always active" logs and other history entries.
3. **General Code Health:**
* I ensured necessary imports (`fs`, `path`, `logger`) are present in `agent.js`.
I tested the changes by simulating the "always active" vision scenario and verifying that `logger.logVision` was called with the correct arguments, including the complete formatted history string.
- I implemented universal logging for all API providers in src/models/, ensuring calls to logger.js for text and vision logs.
- I added transformation of <thinking>...</thinking> tags to <think>...</think> in all provider responses before logging, for correct categorization by logger.js.
- I standardized the input to logger.js's log() function to be a JSON string of the message history (system prompt + turns).
- I removed unnecessary comments from most API provider files, settings.js, and prompter.js to improve readability.
Note: I encountered some issues that prevented final comment cleanup for qwen.js, vllm.js, and logger.js. Their core logging functionality and tag transformations (for qwen.js and vllm.js) are in place from previous steps.
I implemented comprehensive logging across all API providers in src/models/ using logger.js.
This includes:
- Adding log() and logVision() calls to each provider (Claude, DeepSeek, Gemini, GLHF, GPT, Grok, Groq, HuggingFace, Hyperbolic, Local, Mistral, Novita, Qwen, Replicate, VLLM).
- Ensuring logging respects 'log_normal_data', 'log_reasoning_data', and 'log_vision_data' flags in settings.js, which I added.
- I deprecated 'log_all_prompts' in settings.js and updated prompter.js accordingly.
I refactored openrouter.js and prompter.js:
- I removed the experimental reasoning prompt functionality ($REASONING) from openrouter.js.
- I removed a previously implemented (and then reverted) personality injection feature ($PERSONALITY) from prompter.js, openrouter.js, and profile files.
I had to work around some issues:
- I replaced the full file content for glhf.js and hyperbolic.js due to persistent errors with applying changes.
Something I still need to do:
- Based on your latest feedback, model responses containing <thinking>...</thinking> tags need to be transformed to <think>...</think> tags before being passed to logger.js to ensure they are categorized into reasoning_logs.csv. This change is not included in this update.
- Unified logging for `prompter.js` to use granular settings from `settings.js` (e.g., `log_normal_data`) instead of `log_all_prompts`, which has been deprecated.
- Removed the experimental reasoning prompt functionality (formerly triggered by `$REASONING`) from `openrouter.js`.
- Reverted the recently added personality injection feature (`$PERSONALITY` and `getRandomPersonality`) from `prompter.js`, `openrouter.js`, and profile files as per your request.
- Verified that `openrouter.js` correctly utilizes `logger.js` for standard and vision logs.
This update finalizes the implementation of three distinct vision modes:
- "off": This disables all my vision capabilities.
- "prompted": (Formerly "on") This allows me to use vision via explicit commands from you (e.g., !lookAtPlayer), and I will then summarize the image.
- "always": (Formerly "always_active") I will automatically take a screenshot every time you send a prompt and send it with your prompt to a multimodal LLM. If you use a look command in this mode, I will only update my view and take a screenshot for the *next* interaction if relevant, without immediate summarization.
Here are the key changes and improvements:
1. **Bug Fix (Image Path ENOENT)**:
* I've corrected `Camera.capture()` so it returns filenames with the `.jpg` extension.
* I've updated `VisionInterpreter.analyzeImage()` to handle full filenames.
* This resolves the `ENOENT` error that was previously happening in `Prompter.js`.
2. **Vision Mode Renaming**:
* I've renamed the modes in `settings.js` and throughout the codebase: "on" is now "prompted", and "always_active" is now "always".
3. **Core Framework (from previous work, now integrated)**:
* I've added `vision_mode` to `settings.js`.
* `Agent.js` now manages `latestScreenshotPath` and initializes `VisionInterpreter` with `vision_mode`.
* `VisionInterpreter.js` handles different behaviors for each mode.
* My vision commands (`!lookAt...`) respect the `off` mode.
* `History.js` stores `imagePath` with turns, and `Agent.js` manages this path's lifecycle.
* `Prompter.js` reads image files when I'm in "always" mode and passes `imageData` to model wrappers.
4. **Extended Multimodal API Support**:
* `gemini.js`, `gpt.js`, `claude.js`, `local.js` (Ollama), `qwen.js`, and `deepseek.js` have been updated to accept `imageData` in their `sendRequest` method and format it for their respective multimodal APIs. They now include `supportsRawImageInput = true`.
* Other model wrappers (`mistral.js`, `glhf.js`, `grok.js`, etc.) now safely handle the `imageData` parameter in `sendRequest` (by ignoring it and logging a warning) and have `supportsRawImageInput = false` for that method, ensuring consistent behavior.
5. **Testing**: I have a comprehensive plan to verify all modes and functionalities.
This set of changes provides a robust and flexible vision system for me, catering to different operational needs and supporting various multimodal LLMs.
This commit introduces a comprehensive framework for three new vision modes: 'off', 'on', and 'always_active'.
Key changes include:
1. **Settings (`settings.js`)**: Added a `vision_mode` setting.
2. **Agent State (`src/agent/agent.js`)**:
* Added `latestScreenshotPath` to store the most recent screenshot.
* Updated `VisionInterpreter` initialization to use `vision_mode`.
3. **Screenshot Handling**:
* `VisionInterpreter` now updates `agent.latestScreenshotPath` after look commands.
* `Agent.handleMessage` captures screenshots in `always_active` mode for your messages.
4. **VisionInterpreter (`src/agent/vision/vision_interpreter.js`)**:
* Refactored to support distinct behaviors for `off` (disabled), `on` (summarize), and `always_active` (capture-only, no summarization for look commands).
5. **Vision Commands (`src/agent/commands/actions.js`)**:
* `!lookAtPlayer` and `!lookAtPosition` now respect `vision_mode: 'off'` and camera availability.
6. **History Storage (`src/agent/history.js`)**:
* `History.add` now supports an `imagePath` for each turn.
* `Agent.js` correctly passes `latestScreenshotPath` for relevant turns in `always_active` mode and manages its lifecycle.
7. **Prompter Logic (`src/models/prompter.js`)**:
* `Prompter.promptConvo` now reads image files specified in history for `always_active` mode and passes `imageData` to the chat model.
8. **Model API Wrappers (Example: `src/models/gemini.js`)**:
* `gemini.js` updated to accept `imageData` in `sendRequest`.
* Added `supportsRawImageInput` flag to `gemini.js`.
The system is now structured to support these vision modes. The `always_active` mode, where raw images are sent with prompts, is fully implemented for the Gemini API.
Further work will involve extending this raw image support in `always_active` mode to all other capable multimodal API providers as per your feedback.