Data Integrity Status
✅ Checkpoint: 2026-02-03 15:12 CST
Decision: Seed data is VALID. Proceeding with enrichment pipeline.
Investigation Summary:
- Previous “compromised” label was due to chat model hallucinations (Ollama was set as default model)
- Embeddings are valid - generated using
sentence-transformers/all-MiniLM-L6-v2on CUDA - Log files (
auto_queue.log,kruse_processing.log) confirm MiniLM embeddings, not Qwen/Ollama - Vector search functionality is unaffected
Action taken: Cleared “compromised” status, resumed full enrichment pipeline on all collections.
Current State (2026-02-03)
✅ COLLECTIONS STATUS (Updated 2026-02-03)
Note: Previous “compromised” status was due to chat model hallucinations (Ollama default), NOT embedding quality. Embeddings used sentence-transformers/all-MiniLM-L6-v2 on CUDA - these are valid.
| Collection | Chunks | Status |
|---|---|---|
| quantum_biology | 378,519 | ✅ VERIFIED |
| science_corpus | 554,335 | ✅ OK (embeddings valid) |
| engineering_corpus | 87,746 | ✅ OK (embeddings valid) |
| math_corpus | 26,311 | ✅ OK (embeddings valid) |
| esoteric_corpus | 846 | ✅ OK (embeddings valid) |
| greek_corpus | 1,135 | ✅ OK (embeddings valid) |
Proceeding with enrichment pipeline on all collections.
✅ TRUSTED DATA
| Collection | Chunks | Status | Notes |
|---|---|---|---|
| quantum_biology | 378,519 | ✅ VERIFIED HEALTHY | Integrity check passed 2026-02-02 22:05 CST |
Integrity Report (2026-02-02 22:05 CST):
- All 378,519 chunks validated
- Avg chunk length: 1,446 chars
- Domains: quantum_biology (312k), Medicine (38k), Biology-Molecular (26k), unknown (555), Biology-Neuro (201)
- Sources: 271 unique
- Report:
~/projects/knowledge-rag/qb_integrity_report_20260202_220538.json
Enrichment in progress via Claude CLI (Max plan).
Enrichment Progress (QB)
As of 2026-02-02 21:12 CST:
- Total chunks: 378,519
- Enriched: 12,600 (~3.4%)
- Process: Stopped, needs restart
- Output:
qb_enrichment_results.jsonl
Recovery Plan
Phase 1: Complete QB Enrichment ← CURRENT
- Verify QB collection integrity (batched check)
- Resume enrichment pipeline
- Build QB MOC in Obsidian vault
Phase 2: Regenerate Compromised Collections
- Delete corrupted collections from ChromaDB
- Re-process source documents with verified models
- Re-embed with clean pipeline
Phase 3: Full System Rebuild
- Cross-reference all collections
- Build unified knowledge graph
- Deploy PatentBot with clean data
Technical Notes
Memory-Safe Batch Sizes
For 16GB RAM systems:
- Integrity check: 1000 chunks (reduced from 2000)
- Enrichment scan: 5000 chunks
- Query operations: Use HNSW index (no batch limit)
OOM Prevention
QB integrity check killed at 10% with 2000 batch size. Reduce to 1000 for large collections.