- 83/83 tests passing (was 32/45) - New: src/http-client.ts (shared HTTP/HTTPS client, fixes C2+H1) - Fixed: proper_noun regex exclusions (C6) - Fixed: shutdown hooks registered in hooks.ts (C3) - Fixed: all timers use .unref() (H6) - Fixed: resolveConfig split into smaller functions (C4) - Fixed: extract() split with processMatch helper (C5) - Fixed: FactStore.addFact isLoaded guard (H3) - Fixed: validateConfig split (H2) - Fixed: type-safe config merge, removed as any (H4) - Added: http-client tests, expanded coverage (H5) - Fixed: LLM batch await (S1), fresh RegExp per call (S2) - 1530 LOC source, 1298 LOC tests, strict TypeScript
374 lines
16 KiB
Markdown
374 lines
16 KiB
Markdown
# Architecture: @vainplex/openclaw-knowledge-engine
|
|
|
|
## 1. Overview and Scope
|
|
|
|
`@vainplex/openclaw-knowledge-engine` is a TypeScript-based OpenClaw plugin for real-time and batch knowledge extraction from conversational data. It replaces a collection of legacy Python scripts with a unified, modern, and tightly integrated solution.
|
|
|
|
The primary goal of this plugin is to identify, extract, and store key information (entities and facts) from user and agent messages. This knowledge is then made available for long-term memory, context enrichment, and improved agent performance. It operates directly within the OpenClaw event pipeline, eliminating the need for external NATS consumers and schedulers.
|
|
|
|
### 1.1. Core Features
|
|
|
|
- **Hybrid Entity Extraction:** Combines high-speed, low-cost regex extraction with optional, high-fidelity LLM-based extraction.
|
|
- **Structured Fact Store:** Manages a durable store of facts with metadata, relevance scoring, and a temporal decay mechanism.
|
|
- **Seamless Integration:** Hooks directly into OpenClaw's lifecycle events (`message_received`, `message_sent`, `session_start`).
|
|
- **Configurable & Maintainable:** All features are configurable via a JSON schema, and the TypeScript codebase ensures type safety and maintainability.
|
|
- **Zero Runtime Dependencies:** Relies only on Node.js built-in APIs, mirroring the pattern of `@vainplex/openclaw-cortex`.
|
|
- **Optional Embeddings:** Can integrate with ChromaDB for semantic search over extracted facts.
|
|
|
|
### 1.2. Out of Scope
|
|
|
|
- **TypeDB Integration:** The legacy TypeDB dependency is explicitly removed and will not be supported.
|
|
- **Direct NATS Consumption:** The plugin relies on OpenClaw hooks, not direct interaction with NATS streams.
|
|
- **UI/Frontend:** This plugin is purely a backend data processing engine.
|
|
|
|
---
|
|
|
|
## 2. Module Breakdown
|
|
|
|
The plugin will be structured similarly to `@vainplex/openclaw-cortex`, with a clear separation of concerns between modules. All source code will reside in the `src/` directory.
|
|
|
|
| File | Responsibility |
|
|
| --------------------- | -------------------------------------------------------------------------------------------------------------- |
|
|
| `index.ts` | Plugin entry point. Registers hooks, commands, and performs initial configuration validation. |
|
|
| `src/hooks.ts` | Main integration logic. Registers and orchestrates all OpenClaw hook handlers. Manages shared state. |
|
|
| `src/types.ts` | Centralized TypeScript type definitions for configuration, entities, facts, and API interfaces. |
|
|
| `src/config.ts` | Provides functions for resolving and validating the plugin's configuration from `openclaw.plugin.json`. |
|
|
| `src/storage.ts` | Low-level file I/O utilities for reading/writing JSON files, ensuring atomic writes and handling debouncing. |
|
|
| `src/entity-extractor.ts`| Implements the entity extraction pipeline. Contains the `EntityExtractor` class. |
|
|
| `src/fact-store.ts` | Implements the fact storage and retrieval logic. Contains the `FactStore` class, including decay logic. |
|
|
| `src/llm-enhancer.ts` | Handles communication with an external LLM (e.g., Ollama) for batched, deep extraction of entities and facts. |
|
|
| `src/embeddings.ts` | Manages optional integration with ChromaDB, including batching and syncing embeddings. |
|
|
| `src/maintenance.ts` | Encapsulates background tasks like fact decay and embeddings sync, triggered by an internal timer. |
|
|
| `src/patterns.ts` | Stores default regex patterns for common entities (dates, names, locations, etc.). |
|
|
|
|
---
|
|
|
|
## 3. Type Definitions
|
|
|
|
Located in `src/types.ts`.
|
|
|
|
```typescript
|
|
// src/types.ts
|
|
|
|
/**
|
|
* The public API exposed by the OpenClaw host to the plugin.
|
|
*/
|
|
export interface OpenClawPluginApi {
|
|
pluginConfig: Record<string, unknown>;
|
|
logger: {
|
|
info: (msg: string) => void;
|
|
warn: (msg: string) => void;
|
|
error: (msg: string) => void;
|
|
};
|
|
on: (event: string, handler: (event: HookEvent, ctx: HookContext) => void, options?: { priority: number }) => void;
|
|
}
|
|
|
|
export interface HookEvent {
|
|
content?: string;
|
|
message?: string;
|
|
text?: string;
|
|
from?: string;
|
|
sender?: string;
|
|
role?: "user" | "assistant";
|
|
[key: string]: unknown;
|
|
}
|
|
|
|
export interface HookContext {
|
|
workspace: string; // Absolute path to the OpenClaw workspace
|
|
}
|
|
|
|
/**
|
|
* Plugin configuration schema, validated from openclaw.plugin.json.
|
|
*/
|
|
export interface KnowledgeConfig {
|
|
enabled: boolean;
|
|
workspace: string;
|
|
extraction: {
|
|
regex: {
|
|
enabled: boolean;
|
|
};
|
|
llm: {
|
|
enabled: boolean;
|
|
model: string;
|
|
endpoint: string;
|
|
batchSize: number;
|
|
cooldownMs: number;
|
|
};
|
|
};
|
|
decay: {
|
|
enabled: boolean;
|
|
intervalHours: number;
|
|
rate: number; // e.g., 0.05 for 5% decay per interval
|
|
};
|
|
embeddings: {
|
|
enabled: boolean;
|
|
endpoint: string;
|
|
syncIntervalMinutes: number;
|
|
collectionName: string;
|
|
};
|
|
storage: {
|
|
maxEntities: number;
|
|
maxFacts: number;
|
|
writeDebounceMs: number;
|
|
};
|
|
}
|
|
|
|
/**
|
|
* Represents an extracted entity.
|
|
*/
|
|
export interface Entity {
|
|
id: string; // e.g., "person:claude"
|
|
type: "person" | "location" | "organization" | "date" | "product" | "concept" | "unknown";
|
|
value: string; // The canonical value, e.g., "Claude"
|
|
mentions: string[]; // Different ways it was mentioned, e.g., ["claude", "Claude's"]
|
|
count: number;
|
|
importance: number; // 0.0 to 1.0
|
|
lastSeen: string; // ISO 8601 timestamp
|
|
source: ("regex" | "llm")[];
|
|
}
|
|
|
|
/**
|
|
* Represents a structured fact.
|
|
*/
|
|
export interface Fact {
|
|
id: string; // UUID v4
|
|
subject: string; // Entity ID
|
|
predicate: string; // e.g., "is-a", "has-property", "works-at"
|
|
object: string; // Entity ID or literal value
|
|
relevance: number; // 0.0 to 1.0, subject to decay
|
|
createdAt: string; // ISO 8601 timestamp
|
|
lastAccessed: string; // ISO 8601 timestamp
|
|
source: "ingested" | "extracted-regex" | "extracted-llm";
|
|
}
|
|
|
|
/**
|
|
* Data structure for entities.json
|
|
*/
|
|
export interface EntitiesData {
|
|
updated: string;
|
|
entities: Entity[];
|
|
}
|
|
|
|
/**
|
|
* Data structure for facts.json
|
|
*/
|
|
export interface FactsData {
|
|
updated: string;
|
|
facts: Fact[];
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Hook Integration Points
|
|
|
|
The plugin will register handlers for the following OpenClaw core events:
|
|
|
|
| Hook Event | Priority | Handler Logic |
|
|
| ------------------ | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `message_received` | 100 | - Triggers the real-time entity extraction pipeline. <br> - Extracts content and sender. <br> - Adds the message to the `LlmEnhancer` batch if LLM is enabled. |
|
|
| `message_sent` | 100 | - Same as `message_received`. Ensures the agent's own messages are processed for knowledge. |
|
|
| `session_start` | 200 | - Initializes the `Maintenance` service. <br> - Starts the internal timers for fact decay and embeddings sync. <br> - Ensures workspace directories exist. |
|
|
|
|
---
|
|
|
|
## 5. Entity Extraction Pipeline
|
|
|
|
The extraction process runs on every message and is designed to be fast and efficient.
|
|
|
|
### 5.1. Regex Extraction
|
|
|
|
- **Always On (if enabled):** Runs first on every message.
|
|
- **Patterns:** A configurable set of regular expressions will be defined in `src/patterns.ts`. These will cover common entities like dates (`YYYY-MM-DD`), email addresses, URLs, and potentially user-defined patterns.
|
|
- **Performance:** This step is extremely fast and has negligible overhead.
|
|
- **Output:** Produces a preliminary list of potential entities.
|
|
|
|
### 5.2. LLM Enhancement (Batched)
|
|
|
|
- **Optional:** Enabled via configuration.
|
|
- **Batching:** The `LlmEnhancer` class collects messages up to `batchSize` or until `cooldownMs` has passed since the last message. This avoids overwhelming the LLM with single requests.
|
|
- **Process:**
|
|
1. A batch of messages is formatted into a single prompt.
|
|
2. The prompt instructs the LLM to identify entities (person, location, etc.) and structured facts (triples like `Subject-Predicate-Object`).
|
|
3. The request is sent to the configured LLM endpoint (`extraction.llm.endpoint`).
|
|
4. The LLM's JSON response is parsed.
|
|
- **Merging:** LLM-extracted entities are merged with the regex-based results. The `source` array on the `Entity` object is updated to reflect that it was identified by both methods. LLM results are generally given a higher initial `importance` score.
|
|
|
|
---
|
|
|
|
## 6. Fact Store Design
|
|
|
|
The `FactStore` class manages the `facts.json` file, providing an in-memory cache and methods for interacting with facts.
|
|
|
|
### 6.1. Data Structure (`facts.json`)
|
|
|
|
The file will contain a `FactsData` object:
|
|
|
|
```json
|
|
{
|
|
"updated": "2026-02-17T15:30:00Z",
|
|
"facts": [
|
|
{
|
|
"id": "f0a4c1b0-9b1e-4b7b-8f3a-0e9c8d7b6a5a",
|
|
"subject": "person:atlas",
|
|
"predicate": "is-a",
|
|
"object": "sub-agent",
|
|
"relevance": 0.95,
|
|
"createdAt": "2026-02-17T14:00:00Z",
|
|
"lastAccessed": "2026-02-17T15:20:00Z",
|
|
"source": "extracted-llm"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 6.2. `FactStore` Class API
|
|
|
|
```typescript
|
|
// src/fact-store.ts
|
|
class FactStore {
|
|
constructor(workspace: string, config: KnowledgeConfig['storage'], logger: Logger);
|
|
|
|
// Load facts from facts.json into memory
|
|
load(): Promise<void>;
|
|
|
|
// Add a new fact or update an existing one
|
|
addFact(fact: Omit<Fact, 'id' | 'createdAt' | 'lastAccessed'>): Fact;
|
|
|
|
// Retrieve a fact by its ID
|
|
getFact(id: string): Fact | undefined;
|
|
|
|
// Query facts based on subject, predicate, or object
|
|
query(query: { subject?: string; predicate?: string; object?: string }): Fact[];
|
|
|
|
// Run the decay algorithm on all facts
|
|
decayFacts(rate: number): { decayedCount: number };
|
|
|
|
// Persist the in-memory store to disk (debounced)
|
|
commit(): Promise<void>;
|
|
}
|
|
```
|
|
|
|
### 6.3. Storage and Persistence
|
|
|
|
- **Debounced Writes:** All modifications to the fact store will trigger a debounced `commit()` call. This ensures that rapid, successive writes (e.g., during a fast-paced conversation) are batched into a single file I/O operation, configured by `storage.writeDebounceMs`.
|
|
- **Atomic Writes:** The `storage.ts` module will use a "write to temp file then rename" strategy to prevent data corruption if the application terminates mid-write.
|
|
|
|
---
|
|
|
|
## 7. Decay Algorithm
|
|
|
|
The decay algorithm prevents the fact store from becoming cluttered with stale, irrelevant information. It is managed by the `Maintenance` service.
|
|
|
|
- **Trigger:** Runs on a schedule defined by `decay.intervalHours`.
|
|
- **Logic:** For each fact, the relevance score is reduced by the `decay.rate`.
|
|
```
|
|
newRelevance = currentRelevance * (1 - decayRate)
|
|
```
|
|
- **Floor:** Relevance will not decay below a certain floor (e.g., 0.1) to keep it in the system.
|
|
- **Promotion:** When a fact is "accessed" (e.g., used to answer a question or mentioned again), its `relevance` score is boosted, and its `lastAccessed` timestamp is updated. A simple boost could be `newRelevance = currentRelevance + (1 - currentRelevance) * 0.5`, pushing it halfway to 1.0.
|
|
- **Pruning:** Facts with a relevance score below a configurable threshold (e.g., 0.05) after decay might be pruned from the store entirely if `storage.maxFacts` is exceeded.
|
|
|
|
---
|
|
|
|
## 8. Embeddings Integration
|
|
|
|
This feature allows for semantic querying of facts and is entirely optional.
|
|
|
|
### 8.1. `Embeddings` Service
|
|
|
|
- **Trigger:** Runs on a schedule defined by `embeddings.syncIntervalMinutes`.
|
|
- **Process:**
|
|
1. The service scans `facts.json` for any facts that have not yet been embedded.
|
|
2. It formats each fact into a natural language string, e.g., "Atlas is a sub-agent."
|
|
3. It sends a batch of these strings to a ChromaDB-compatible vector database via its HTTP API.
|
|
4. The fact's ID is stored as metadata alongside the vector in ChromaDB.
|
|
- **Configuration:** The `embeddings.endpoint` must be a valid URL to the ChromaDB `/api/v1/collections/{name}/add` endpoint.
|
|
- **Decoupling:** The plugin does **not** query ChromaDB. Its only responsibility is to push embeddings. Other plugins or services would be responsible for leveraging the vector store for retrieval-augmented generation (RAG).
|
|
|
|
---
|
|
|
|
## 9. Config Schema
|
|
|
|
The full `openclaw.plugin.json` schema for this plugin.
|
|
|
|
```json
|
|
{
|
|
"id": "@vainplex/openclaw-knowledge-engine",
|
|
"config": {
|
|
"enabled": true,
|
|
"workspace": "~/.clawd/plugins/knowledge-engine",
|
|
"extraction": {
|
|
"regex": {
|
|
"enabled": true
|
|
},
|
|
"llm": {
|
|
"enabled": true,
|
|
"model": "mistral:7b",
|
|
"endpoint": "http://localhost:11434/api/generate",
|
|
"batchSize": 10,
|
|
"cooldownMs": 30000
|
|
}
|
|
},
|
|
"decay": {
|
|
"enabled": true,
|
|
"intervalHours": 24,
|
|
"rate": 0.02
|
|
},
|
|
"embeddings": {
|
|
"enabled": false,
|
|
"endpoint": "http://localhost:8000/api/v1/collections/facts/add",
|
|
"collectionName": "openclaw-facts",
|
|
"syncIntervalMinutes": 15
|
|
},
|
|
"storage": {
|
|
"maxEntities": 5000,
|
|
"maxFacts": 10000,
|
|
"writeDebounceMs": 15000
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Test Strategy
|
|
|
|
Testing will be comprehensive and follow the patterns of `@vainplex/openclaw-cortex`, using Node.js's built-in test runner.
|
|
|
|
- **Unit Tests:** Each class (`EntityExtractor`, `FactStore`, `LlmEnhancer`, etc.) will have its own test file (e.g., `fact-store.test.ts`). Tests will use mock objects for dependencies like the logger and file system.
|
|
- **Integration Tests:** `hooks.test.ts` will test the end-to-end flow by simulating OpenClaw hook events and asserting that the correct file system changes occur.
|
|
- **Configuration Tests:** `config.test.ts` will verify that default values are applied correctly and that invalid configurations are handled gracefully.
|
|
- **CI/CD:** Tests will be run automatically in a CI pipeline on every commit.
|
|
|
|
---
|
|
|
|
## 11. Migration Guide
|
|
|
|
This section outlines the process for decommissioning the old Python scripts and migrating to the new plugin.
|
|
|
|
1. **Disable Old Services:** Stop and disable the `systemd` services and timers for `entity-extractor-stream.py`, `smart-extractor.py`, `knowledge-engine.py`, and `cortex-loops-stream.py`.
|
|
```bash
|
|
systemctl stop entity-extractor-stream.service smart-extractor.timer knowledge-engine.service cortex-loops.timer
|
|
systemctl disable entity-extractor-stream.service smart-extractor.timer knowledge-engine.service cortex-loops.timer
|
|
```
|
|
|
|
2. **Install the Plugin:** Install the `@vainplex/openclaw-knowledge-engine` plugin into OpenClaw according to standard procedures.
|
|
|
|
3. **Configure the Plugin:** Create a configuration file at `~/.clawd/plugins/openclaw-knowledge-engine.json` (or the equivalent path) using the schema from section 9. Ensure the `workspace` directory is set to the desired location.
|
|
|
|
4. **Data Migration (Optional):**
|
|
- **Entities:** A one-time script (`./scripts/migrate-entities.js`) will be provided to convert the old `~/.cortex/knowledge/entities.json` format to the new `Entity` format defined in `src/types.ts`.
|
|
- **Facts:** As the old `knowledge-engine.py` had a different structure and no durable fact store equivalent to `facts.json`, facts will not be migrated. The system will start with a fresh fact store.
|
|
- **TypeDB:** No migration from TypeDB will be provided.
|
|
|
|
5. **Enable and Restart:** Enable the plugin in OpenClaw's main configuration and restart the OpenClaw instance. Monitor the logs for successful initialization.
|
|
|
|
---
|
|
|
|
## 12. Performance Requirements
|
|
|
|
- **Message Hook Overhead:** The synchronous part of the message hook (regex extraction) must complete in under **5ms** on average to avoid delaying the message processing pipeline.
|
|
- **LLM Latency:** LLM processing is asynchronous and batched, so it does not block the main thread. However, the total time to analyze a batch should be logged and monitored.
|
|
- **Memory Usage:** The plugin's heap size should not exceed **100MB** under normal load, assuming the configured `maxEntities` and `maxFacts` limits.
|
|
- **CPU Usage:** Background maintenance tasks (decay, embeddings sync) should be staggered and have low CPU impact, consuming less than 5% of a single core while running.
|