OCR Pipeline for RAG Ingestion
Handles scanned PDFs that fail text extraction during initial processing.
Overview
Source PDFs → Initial Extraction Attempt
↓
Text extracted?
/ \
Yes No (scanned)
↓ ↓
Chunk & Copy to Paperless
Ingest consume folder
↓
Paperless OCR
↓
Watcher cron
↓
Chunk & Ingest
↓
Notify Shadow
Components
1. Paperless-ngx (OCR Engine)
- Location: localhost:8000 (Pop!_OS Docker)
- Consume folder:
~/docker-services/paperless/consume/ - Credentials: admin / changeme123
- Features: GPU-accelerated OCR, automatic text extraction
2. Paperless Watcher (paperless_watcher.py)
- Location:
~/projects/knowledge-rag/paperless_watcher.py - Schedule: Every 10 minutes via cron
- Function:
- Polls Paperless API for new documents
- Matches against watch patterns (Bohm, Greek, Loeb, etc.)
- Retrieves OCR’d content
- Chunks and ingests to appropriate ChromaDB collection
- Sends notification when complete
3. Collection Mapping
| Pattern | Target Collection |
|---|---|
| Bohm | physicists_corpus |
| Greek, greek | greek_corpus |
| Loeb, loeb | greek_corpus |
| Perseus | greek_corpus |
| (default) | knowledge_base |
Usage
Adding Scanned PDFs for OCR
- Identify skipped files in processing logs:
grep "Skipped (no text)" ~/projects/knowledge-rag/*_process.log- Copy to Paperless consume folder:
cp "/path/to/scanned.pdf" ~/docker-services/paperless/consume/- Name files with collection hints:
# Good naming for auto-routing:
Bohm-Paper_Title_Year.pdf → physicists_corpus
Greek-Author_Work.pdf → greek_corpus
Loeb-Homer_Iliad.pdf → greek_corpus- Wait for OCR (watch logs):
docker logs paperless --tail 50 -f | grep -i ocr- Watcher auto-ingests and notifies when complete
Manual Trigger
cd ~/projects/knowledge-rag && source venv/bin/activate
python3 paperless_watcher.pyCron Jobs
# Current cron entries
*/5 * * * * docker exec paperless python3 manage.py document_consumer --oneshot
*/10 * * * * cd ~/projects/knowledge-rag && ./venv/bin/python3 paperless_watcher.pyMonitoring
Check Paperless Queue
curl -s -u admin:changeme123 "http://localhost:8000/api/documents/" | jq '.count'Check Watcher Logs
tail -f ~/projects/knowledge-rag/logs/paperless_watcher.logCheck Notification File
cat ~/projects/knowledge-rag/PAPERLESS_NOTIFY.txtState File
paperless_watch_state.json tracks:
processed_ids: Documents already ingested (prevents duplicates)last_check: Timestamp of last poll
Troubleshooting
OCR Not Starting
# Check Paperless container
docker logs paperless --tail 100
# Force consumer run
docker exec paperless python3 manage.py document_consumer --oneshotDocument Has No Content
- OCR may still be processing (wait 5-10 min for large docs)
- Check Paperless web UI for status: http://localhost:8000
Wrong Collection
- Rename file with correct pattern prefix before adding to consume
- Or manually move chunks between collections
Files
| File | Purpose |
|---|---|
paperless_watcher.py | Main watcher script |
paperless_to_chroma.py | Manual ingest (deprecated) |
paperless_watch_state.json | State tracking |
logs/paperless_watcher.log | Cron output |
PAPERLESS_NOTIFY.txt | Notification flag |