ChromaDB Memory-Safe Batching
Problem
ChromaDB operations on large collections (100k+ chunks) can OOM kill when loading all data at once:
# ❌ THIS WILL OOM ON LARGE COLLECTIONS
all_data = collection.get(
limit=500000,
include=["documents", "metadatas"]
)With 378k+ QB chunks and 554k+ science chunks, this tries to load gigabytes into RAM.
Solution: Batched Fetching
Fetch in pages of 2,000-5,000 chunks:
FETCH_BATCH = 5000
offset = 0
while offset < total:
batch_data = collection.get(
limit=FETCH_BATCH,
offset=offset,
include=["documents", "metadatas"]
)
if not batch_data["ids"]:
break
# Process batch...
offset += FETCH_BATCHFiles Modified (2026-02-02)
qb_enrichment.py
Changed bulk fetch to batched iteration:
- Fetch batch size: 5,000 chunks
- Scans through collection incrementally
- Checks against
processed_setper batch - Logs progress every batch
chromadb_integrity_check.py (NEW)
Created memory-safe integrity checker:
- Batch size: 2,000 chunks
- Checks all collections sequentially
- Validates: empty docs, missing metadata, duplicate IDs
- Reports domain/source distribution
- Saves JSON report on completion
Key Settings
| Operation | Batch Size | Safe for |
|---|---|---|
| Integrity check | 2,000 | 16GB RAM |
| Enrichment scan | 5,000 | 16GB RAM |
| Query/search | N/A | Uses HNSW index |
Python Buffering Fix
For real-time output in background processes:
# Add flush=True to print statements
def log(msg):
print(f"[{timestamp}] {msg}", flush=True)
# Or run with unbuffered flag
python3 -u script.py