ChromaDB Memory-Safe Batching

Problem

ChromaDB operations on large collections (100k+ chunks) can OOM kill when loading all data at once:

# ❌ THIS WILL OOM ON LARGE COLLECTIONS
all_data = collection.get(
    limit=500000,
    include=["documents", "metadatas"]
)

With 378k+ QB chunks and 554k+ science chunks, this tries to load gigabytes into RAM.

Solution: Batched Fetching

Fetch in pages of 2,000-5,000 chunks:

FETCH_BATCH = 5000
offset = 0
 
while offset < total:
    batch_data = collection.get(
        limit=FETCH_BATCH,
        offset=offset,
        include=["documents", "metadatas"]
    )
    
    if not batch_data["ids"]:
        break
    
    # Process batch...
    
    offset += FETCH_BATCH

Files Modified (2026-02-02)

qb_enrichment.py

Changed bulk fetch to batched iteration:

  • Fetch batch size: 5,000 chunks
  • Scans through collection incrementally
  • Checks against processed_set per batch
  • Logs progress every batch

chromadb_integrity_check.py (NEW)

Created memory-safe integrity checker:

  • Batch size: 2,000 chunks
  • Checks all collections sequentially
  • Validates: empty docs, missing metadata, duplicate IDs
  • Reports domain/source distribution
  • Saves JSON report on completion

Key Settings

OperationBatch SizeSafe for
Integrity check2,00016GB RAM
Enrichment scan5,00016GB RAM
Query/searchN/AUses HNSW index

Python Buffering Fix

For real-time output in background processes:

# Add flush=True to print statements
def log(msg):
    print(f"[{timestamp}] {msg}", flush=True)
 
# Or run with unbuffered flag
python3 -u script.py