Enrichment Pipeline Status
Last updated: 2026-02-03 19:40 CST
Current Status
Screen session: enrichment_all (started 2026-02-03 15:12)
Running: Sequential enrichment via run_enrichment_queue.sh
Issue: ChromaDB Rust bindings segfault on concurrent access (monolith DB)
Collection Progress
| Collection | Enriched | Total | % Complete | Est. Time Left |
|---|---|---|---|---|
| quantum_biology | 43,109 | 378,519 | 11% | 74.5 hrs |
| science_corpus | 1,459 | 554,335 | 0% | 122.8 hrs |
| physicists_corpus | 1,726 | 50,642 | 3% | 10.8 hrs |
| engineering_corpus | 6,479 | 87,746 | 7% | 18.0 hrs |
| math_corpus | 7,003 | 26,311 | 26% | 4.2 hrs |
| knowledgebase | 7,100 | ~7,200 | ~98% | ✅ Near done |
| greek_corpus | 1,156 | 1,156 | 100% | ✅ Done |
| esoteric_corpus | 846 | 846 | 100% | ✅ Done |
| biohacking_corpus | 153 | 153 | 100% | ✅ Done |
| tech_corpus | 1,921 | 1,950 | 98% | ✅ Done |
Total remaining: ~1,032,806 chunks
Estimated time: ~230 hours (~10 days continuous)
Bottlenecks
Big Three (96% of remaining work)
- science_corpus — 554k chunks, 0% done (122 hrs)
- quantum_biology — 378k chunks, 11% done (74 hrs)
- engineering_corpus — 88k chunks, 7% done (18 hrs)
Completed Collections
- greek_corpus ✅
- esoteric_corpus ✅
- biohacking_corpus ✅
- tech_corpus ✅
- knowledgebase ✅ (near)
Technical Details
Enrichment Rate
- ~25 chunks per batch
- ~20 seconds per batch
- ~0.8 seconds per chunk
- Rate limited by LLM API calls
Known Issues
-
ChromaDB Segfault
- Rust bindings crash on concurrent access
- 17GB monolith database
- Solution: Sequential processing only
- Future: Split into multi-DB architecture
-
Memory Usage
- Large collections require careful batching
- Progress saved to
{collection}_enrichment_progress.json - Results saved to
{collection}_enrichment_results.jsonl
File Locations
| File | Purpose |
|---|---|
~/projects/knowledge-rag/qb_enrichment.py | Main enrichment script |
~/projects/knowledge-rag/run_enrichment_queue.sh | Queue runner |
~/projects/knowledge-rag/*_enrichment_progress.json | Progress tracking |
~/projects/knowledge-rag/*_enrichment_results.jsonl | Enrichment output |
/mnt/storage/knowledge-rag/chroma_db | ChromaDB monolith (17GB) |
Resume Instructions
# Check if running
screen -ls
# Attach to monitor
screen -r enrichment_all
# If crashed, restart from queue
cd ~/projects/knowledge-rag
source venv/bin/activate
screen -dmS enrichment ./run_enrichment_queue.shQueue Order
- ✅ quantum_biology (partial)
- ✅ knowledge_base
- ✅ math_corpus (partial)
- ❌ greek_corpus (segfault, but 100% from prior run)
- 🔄 engineering_corpus (IN PROGRESS)
- ⏳ tech_corpus
- ⏳ esoteric_corpus
- ⏳ biohacking_corpus
- ⏳ physicists_corpus
- ⏳ science_corpus
Next Steps
- Manual start only — Shadow will start enrichment runs based on available token budget
- Monitor for segfaults
- Consider multi-DB split for parallel processing
- Prioritize which collections matter most
⚠️ NOTE: Do NOT auto-schedule enrichment via cron. Token costs are significant. Shadow manages when to run based on budget.