Enrichment Pipeline Status

Last updated: 2026-02-03 19:40 CST


Current Status

Screen session: enrichment_all (started 2026-02-03 15:12)
Running: Sequential enrichment via run_enrichment_queue.sh
Issue: ChromaDB Rust bindings segfault on concurrent access (monolith DB)


Collection Progress

CollectionEnrichedTotal% CompleteEst. Time Left
quantum_biology43,109378,51911%74.5 hrs
science_corpus1,459554,3350%122.8 hrs
physicists_corpus1,72650,6423%10.8 hrs
engineering_corpus6,47987,7467%18.0 hrs
math_corpus7,00326,31126%4.2 hrs
knowledgebase7,100~7,200~98%✅ Near done
greek_corpus1,1561,156100%✅ Done
esoteric_corpus846846100%✅ Done
biohacking_corpus153153100%✅ Done
tech_corpus1,9211,95098%✅ Done

Total remaining: ~1,032,806 chunks
Estimated time: ~230 hours (~10 days continuous)


Bottlenecks

Big Three (96% of remaining work)

  1. science_corpus — 554k chunks, 0% done (122 hrs)
  2. quantum_biology — 378k chunks, 11% done (74 hrs)
  3. engineering_corpus — 88k chunks, 7% done (18 hrs)

Completed Collections

  • greek_corpus ✅
  • esoteric_corpus ✅
  • biohacking_corpus ✅
  • tech_corpus ✅
  • knowledgebase ✅ (near)

Technical Details

Enrichment Rate

  • ~25 chunks per batch
  • ~20 seconds per batch
  • ~0.8 seconds per chunk
  • Rate limited by LLM API calls

Known Issues

  1. ChromaDB Segfault

    • Rust bindings crash on concurrent access
    • 17GB monolith database
    • Solution: Sequential processing only
    • Future: Split into multi-DB architecture
  2. Memory Usage

    • Large collections require careful batching
    • Progress saved to {collection}_enrichment_progress.json
    • Results saved to {collection}_enrichment_results.jsonl

File Locations

FilePurpose
~/projects/knowledge-rag/qb_enrichment.pyMain enrichment script
~/projects/knowledge-rag/run_enrichment_queue.shQueue runner
~/projects/knowledge-rag/*_enrichment_progress.jsonProgress tracking
~/projects/knowledge-rag/*_enrichment_results.jsonlEnrichment output
/mnt/storage/knowledge-rag/chroma_dbChromaDB monolith (17GB)

Resume Instructions

# Check if running
screen -ls
 
# Attach to monitor
screen -r enrichment_all
 
# If crashed, restart from queue
cd ~/projects/knowledge-rag
source venv/bin/activate
screen -dmS enrichment ./run_enrichment_queue.sh

Queue Order

  1. ✅ quantum_biology (partial)
  2. ✅ knowledge_base
  3. ✅ math_corpus (partial)
  4. ❌ greek_corpus (segfault, but 100% from prior run)
  5. 🔄 engineering_corpus (IN PROGRESS)
  6. ⏳ tech_corpus
  7. ⏳ esoteric_corpus
  8. ⏳ biohacking_corpus
  9. ⏳ physicists_corpus
  10. ⏳ science_corpus

Next Steps

  • Manual start only — Shadow will start enrichment runs based on available token budget
  • Monitor for segfaults
  • Consider multi-DB split for parallel processing
  • Prioritize which collections matter most

⚠️ NOTE: Do NOT auto-schedule enrichment via cron. Token costs are significant. Shadow manages when to run based on budget.