diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..11eafda --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2026 Vainplex + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..2195881 --- /dev/null +++ b/README.md @@ -0,0 +1,251 @@ +# @vainplex/openclaw-knowledge-engine + +A real-time knowledge extraction plugin for [OpenClaw](https://github.com/openclaw/openclaw). Automatically extracts entities, facts, and relationships from conversations — building a persistent, queryable knowledge base that grows with every message. + +## What it does + +Every message your OpenClaw agent processes flows through the Knowledge Engine: + +1. **Regex Extraction** (instant, zero cost) — Detects people, organizations, technologies, URLs, emails, and other entities using pattern matching +2. **LLM Enhancement** (optional, batched) — Groups messages and sends them to a local LLM for deeper entity and fact extraction +3. **Fact Storage** — Stores extracted knowledge as structured subject-predicate-object triples with relevance scoring +4. **Relevance Decay** — Automatically decays old facts so recent knowledge surfaces first +5. **Vector Sync** — Optionally syncs facts to ChromaDB for semantic search +6. **Background Maintenance** — Prunes low-relevance facts, compacts storage, runs cleanup + +``` +User: "We're meeting with Sebastian from Mondo Gate next Tuesday" + │ + ├─ Regex → entities: [Sebastian (person), Mondo Gate (organization)] + └─ LLM → facts: [Sebastian — works-at — Mondo Gate] + [Meeting — scheduled-with — Mondo Gate] +``` + +## Quick Start + +### 1. Install + +```bash +cd ~/.openclaw +npm install @vainplex/openclaw-knowledge-engine +``` + +### 2. Sync to extensions + +OpenClaw loads plugins from the `extensions/` directory: + +```bash +mkdir -p extensions/openclaw-knowledge-engine +cp -r node_modules/@vainplex/openclaw-knowledge-engine/{dist,package.json,openclaw.plugin.json} extensions/openclaw-knowledge-engine/ +``` + +### 3. Configure + +Add to your `openclaw.json`: + +```json +{ + "plugins": { + "entries": { + "openclaw-knowledge-engine": { + "enabled": true, + "config": { + "workspace": "/path/to/your/workspace", + "extraction": { + "regex": { "enabled": true }, + "llm": { + "enabled": true, + "endpoint": "http://localhost:11434/api/generate", + "model": "mistral:7b", + "batchSize": 10, + "cooldownMs": 30000 + } + } + } + } + } + } +} +``` + +### 4. Restart gateway + +```bash +openclaw gateway restart +``` + +## Configuration + +| Key | Type | Default | Description | +|-----|------|---------|-------------| +| `enabled` | boolean | `true` | Enable/disable the plugin | +| `workspace` | string | `~/.clawd/plugins/knowledge-engine` | Storage directory for knowledge files | +| `extraction.regex.enabled` | boolean | `true` | High-speed regex entity extraction | +| `extraction.llm.enabled` | boolean | `true` | LLM-based deep extraction | +| `extraction.llm.model` | string | `"mistral:7b"` | Ollama/OpenAI-compatible model | +| `extraction.llm.endpoint` | string | `"http://localhost:11434/api/generate"` | LLM API endpoint (HTTP or HTTPS) | +| `extraction.llm.batchSize` | number | `10` | Messages per LLM batch | +| `extraction.llm.cooldownMs` | number | `30000` | Wait time before sending batch | +| `decay.enabled` | boolean | `true` | Periodic relevance decay | +| `decay.intervalHours` | number | `24` | Hours between decay cycles | +| `decay.rate` | number | `0.02` | Decay rate per interval (2%) | +| `embeddings.enabled` | boolean | `false` | Sync facts to ChromaDB | +| `embeddings.endpoint` | string | `"http://localhost:8000/..."` | ChromaDB API endpoint | +| `embeddings.collectionName` | string | `"openclaw-facts"` | Vector collection name | +| `embeddings.syncIntervalMinutes` | number | `15` | Minutes between vector syncs | +| `storage.maxEntities` | number | `5000` | Max entities before pruning | +| `storage.maxFacts` | number | `10000` | Max facts before pruning | +| `storage.writeDebounceMs` | number | `15000` | Debounce delay for disk writes | + +### Minimal config (regex only, no LLM) + +```json +{ + "openclaw-knowledge-engine": { + "enabled": true, + "config": { + "extraction": { + "llm": { "enabled": false } + } + } + } +} +``` + +This gives you zero-cost entity extraction with no external dependencies. + +### Full config (LLM + ChromaDB) + +```json +{ + "openclaw-knowledge-engine": { + "enabled": true, + "config": { + "workspace": "~/my-agent/knowledge", + "extraction": { + "llm": { + "enabled": true, + "endpoint": "http://localhost:11434/api/generate", + "model": "mistral:7b" + } + }, + "embeddings": { + "enabled": true, + "endpoint": "http://localhost:8000/api/v1/collections/facts/add" + }, + "decay": { + "intervalHours": 12, + "rate": 0.03 + } + } + } +} +``` + +## How it works + +### Extraction Pipeline + +``` +Message received + │ + ├──▶ Regex Engine (sync, <1ms) + │ └─ Extracts: proper nouns, organizations, tech terms, + │ URLs, emails, monetary amounts, dates + │ + └──▶ LLM Batch Queue (async, batched) + └─ Every N messages or after cooldown: + └─ Sends batch to local LLM + └─ Extracts: entities + fact triples + └─ Stores in FactStore +``` + +### Fact Lifecycle + +Facts are stored as structured triples: + +```json +{ + "id": "f-abc123", + "subject": "Sebastian", + "predicate": "works-at", + "object": "Mondo Gate", + "source": "extracted-llm", + "relevance": 0.95, + "createdAt": 1707123456789, + "lastAccessedAt": 1707123456789 +} +``` + +- **Relevance** starts at 1.0 and decays over time +- **Accessed facts** get a relevance boost (LRU-style) +- **Pruning** removes facts below the relevance floor when storage limits are hit +- **Minimum floor** (0.1) prevents complete decay — old facts never fully disappear + +### Storage + +All data is persisted as JSON files in your workspace: + +``` +workspace/ +├── entities.json # Extracted entities with types and counts +└── facts.json # Fact triples with relevance scores +``` + +Writes use atomic file operations (write to `.tmp`, then rename) to prevent corruption. + +## Architecture + +``` +index.ts → Plugin entry point +src/ +├── types.ts → All TypeScript interfaces +├── config.ts → Config resolution + validation +├── patterns.ts → Regex factories (Proxy-based, no /g state bleed) +├── entity-extractor.ts → Regex-based entity extraction +├── llm-enhancer.ts → Batched LLM extraction with cooldown +├── fact-store.ts → In-memory fact store with decay + pruning +├── hooks.ts → OpenClaw hook registration + orchestration +├── http-client.ts → Shared HTTP/HTTPS transport +├── embeddings.ts → ChromaDB vector sync +├── storage.ts → Atomic JSON I/O with debounce +└── maintenance.ts → Scheduled background tasks +``` + +- **12 modules**, each with a single responsibility +- **Zero runtime dependencies** — Node.js built-ins only +- **TypeScript strict** — no `any` in source code +- **All functions ≤40 lines** + +## Hooks + +| Hook | Priority | Description | +|------|----------|-------------| +| `session_start` | 200 | Loads fact store from disk | +| `message_received` | 100 | Extracts entities + queues LLM batch | +| `message_sent` | 100 | Same extraction on outbound messages | +| `gateway_stop` | 50 | Flushes writes, stops timers | + +## Testing + +```bash +npm test +# Runs 83 tests across 10 test files +``` + +Tests cover: config validation, entity extraction, fact CRUD, decay, pruning, LLM batching, HTTP client, embeddings, storage atomicity, maintenance scheduling, hook orchestration. + +## Part of the Darkplex Plugin Suite + +| # | Plugin | Status | Description | +|---|--------|--------|-------------| +| 1 | [@vainplex/nats-eventstore](https://github.com/alberthild/openclaw-nats-eventstore) | ✅ Published | NATS JetStream event persistence | +| 2 | [@vainplex/openclaw-cortex](https://github.com/alberthild/openclaw-cortex) | ✅ Published | Conversation intelligence (threads, decisions, boot context) | +| 3 | **@vainplex/openclaw-knowledge-engine** | ✅ Published | Real-time knowledge extraction (this plugin) | +| 4 | @vainplex/openclaw-governance | 📋 Planned | Policy enforcement + guardrails | +| 5 | @vainplex/openclaw-memory-engine | 📋 Planned | Unified memory layer | +| 6 | @vainplex/openclaw-health-monitor | 📋 Planned | System health + auto-healing | + +## License + +MIT diff --git a/package.json b/package.json index 7aa7d9b..360b8df 100644 --- a/package.json +++ b/package.json @@ -21,7 +21,7 @@ "license": "MIT", "repository": { "type": "git", - "url": "https://github.com/your-repo/openclaw-knowledge-engine.git" + "url": "https://github.com/alberthild/openclaw-knowledge-engine.git" }, "openclaw": { "id": "@vainplex/openclaw-knowledge-engine"