PatentBot Changelog
2026-02-01 (Evening) - Full Corpus Expansion
Multi-Domain Batch Processing
Completed auto-queue processing of all major corpora:
| Corpus | Chunks | Status |
|---|---|---|
| science | 554,335 | ✅ Complete |
| engineering | 87,746 | ✅ Complete |
| tech | 43,067 | ✅ Complete |
| math | 26,311 | ✅ Complete |
| knowledge_base | 7,127 | ✅ Complete |
| greek | 1,135 | ✅ Complete |
| esoteric | 846 | ✅ Complete |
| quantum_biology | 555 | ✅ Complete |
| biohacking | 153 | ✅ Complete |
Total: 721,275 chunks (4.5GB+ ChromaDB)
Pipeline Standardization
All data processing now uses:
- Embeddings: sentence-transformers/all-MiniLM-L6-v2 (CUDA)
- LLM (all processing): Claude CLI (Max plan - unlimited)
- Fallback: Local Ollama (qwen2.5:7b) if CLI unavailable
2026-02-01 - Simplification to Claude-CLI Only
Changes Made
Removed External LLM Dependencies:
- ❌ Ollama - Removed all references, config, and
_call_ollama()method - ❌ Groq API - Removed hardcoded API key, config, and
_call_groq()method - ❌ Google Gemini - Removed import, config, API key, and
_call_gemini()method
Simplified Configuration:
# LLM: Claude CLI (Max plan) with Ollama fallback
OLLAMA_URL = "http://localhost:11434"
OLLAMA_MODEL = "qwen2.5:7b"
# Paths
ENGINEERING_CORPUS = os.getenv("CORPUS_PATH", "/mnt/storage/books_organized/Engineering")
CHROMA_PATH = os.getenv("CHROMA_PATH", "./chroma_db")
CHUNK_SIZE = 1500
CHUNK_OVERLAP = 300Simplified PatentBot.init():
# Before: Multiple LLM initialization paths (40+ lines)
def __init__(self, persist_dir, use_sonnet, use_gemini, use_groq, use_claude_cli):
# ... complex conditional LLM setup
# After: Claude CLI only
def __init__(self, persist_dir=CHROMA_PATH, use_claude_cli=True):
self.use_claude_cli = use_claude_cli and shutil.which("claude") is not NoneSimplified _call_llm():
# Before: Complex fallback chain (Ollama → Groq → Claude CLI → Gemini)
# After: Single path through Claude CLI
def _call_llm(self, prompt, max_tokens=2000):
if self.use_claude_cli:
result = self._call_claude_cli(prompt, max_tokens)
if result:
return result
time.sleep(1)
return self._call_claude_cli(prompt, max_tokens)
return ""Corpus State After Cleanup
| Metric | Value |
|---|---|
| Total chunks | 87,746 |
| Unique PDFs | 993 |
| Unreadable PDFs | 7 (encrypted) |
| Duplicates | 0 |
| DB Size | 1.5 GB |
LLM Provider Distribution:
| Provider | Count | Notes |
|---|---|---|
| none | 81,196 | Basic regex extraction |
| unknown | 6,534 | Legacy runs |
| gemini | 16 | Failed run artifacts |
Why These Changes
- OOM Issues - Loading 87k chunks + external LLMs caused memory exhaustion
- Complexity - Multiple LLM fallback paths added debugging difficulty
- API Key Exposure - Hardcoded keys in source code (security risk)
- Simplicity - Claude CLI via Max plan is sufficient for all LLM needs
Current Architecture
PatentBot
├── ChromaDB (1.5GB, 87,746 chunks)
│ └── all-MiniLM-L6-v2 embeddings (GPU-accelerated)
├── Claude CLI (optional, for smart chunking/entity extraction)
└── Regex fallback (materials, processes, properties extraction)
Collections
| Collection | Items |
|---|---|
| engineering_corpus | 87,746 |
| knowledge_base | 7,127 |
| greek_corpus | 1,135 |
| esoteric_corpus | 846 |
| quantum_biology | 555 |
Usage
# Query the corpus (no LLM needed)
python patentbot.py query "titanium alloy heat treatment"
# Status check
python patentbot.py status
# Ingest new PDFs (basic mode)
python patentbot.py ingest
# Ingest with Claude CLI enhancement
python patentbot.py ingest --use-claudeFiles
- Code:
~/projects/knowledge-rag/patentbot.py - Database:
~/projects/knowledge-rag/chroma_db/ - Corpus:
/mnt/storage/books_organized/Engineering/