ChromaDB Memory-Safe Batching

Problem

ChromaDB operations on large collections (100k+ chunks) can OOM kill when loading all data at once:

# ❌ THIS WILL OOM ON LARGE COLLECTIONS
all_data = collection.get(
    limit=500000,
    include=["documents", "metadatas"]
)

With 378k+ QB chunks and 554k+ science chunks, this tries to load gigabytes into RAM.

Solution: Batched Fetching

Fetch in pages of 2,000-5,000 chunks:

FETCH_BATCH = 5000
offset = 0
 
while offset < total:
    batch_data = collection.get(
        limit=FETCH_BATCH,
        offset=offset,
        include=["documents", "metadatas"]
    )
    
    if not batch_data["ids"]:
        break
    
    # Process batch...
    
    offset += FETCH_BATCH

Files Modified (2026-02-02)

`qb_enrichment.py`

Changed bulk fetch to batched iteration:

Fetch batch size: 5,000 chunks
Scans through collection incrementally
Checks against processed_set per batch
Logs progress every batch

`chromadb_integrity_check.py` (NEW)

Created memory-safe integrity checker:

Batch size: 2,000 chunks
Checks all collections sequentially
Validates: empty docs, missing metadata, duplicate IDs
Reports domain/source distribution
Saves JSON report on completion

Key Settings

Operation	Batch Size	Safe for
Integrity check	2,000	16GB RAM
Enrichment scan	5,000	16GB RAM
Query/search	N/A	Uses HNSW index

Python Buffering Fix

For real-time output in background processes:

# Add flush=True to print statements
def log(msg):
    print(f"[{timestamp}] {msg}", flush=True)
 
# Or run with unbuffered flag
python3 -u script.py

Quartz 4

Explorer

ChromaDB Memory-Safe Batching

ChromaDB Memory-Safe Batching

Problem

Solution: Batched Fetching

Files Modified (2026-02-02)

`qb_enrichment.py`

`chromadb_integrity_check.py` (NEW)

Key Settings

Python Buffering Fix

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

ChromaDB Memory-Safe Batching

ChromaDB Memory-Safe Batching

Problem

Solution: Batched Fetching

Files Modified (2026-02-02)

qb_enrichment.py

chromadb_integrity_check.py (NEW)

Key Settings

Python Buffering Fix

Related

Graph View

Table of Contents

Backlinks

`qb_enrichment.py`

`chromadb_integrity_check.py` (NEW)