OCR Pipeline for RAG Ingestion

Handles scanned PDFs that fail text extraction during initial processing.

Overview

Source PDFs → Initial Extraction Attempt
                     ↓
              Text extracted? 
                /         \
              Yes          No (scanned)
               ↓            ↓
           Chunk &      Copy to Paperless
           Ingest       consume folder
                            ↓
                      Paperless OCR
                            ↓
                      Watcher cron
                            ↓
                      Chunk & Ingest
                            ↓
                      Notify Shadow

Components

1. Paperless-ngx (OCR Engine)

  • Location: localhost:8000 (Pop!_OS Docker)
  • Consume folder: ~/docker-services/paperless/consume/
  • Credentials: admin / changeme123
  • Features: GPU-accelerated OCR, automatic text extraction

2. Paperless Watcher (paperless_watcher.py)

  • Location: ~/projects/knowledge-rag/paperless_watcher.py
  • Schedule: Every 10 minutes via cron
  • Function:
    • Polls Paperless API for new documents
    • Matches against watch patterns (Bohm, Greek, Loeb, etc.)
    • Retrieves OCR’d content
    • Chunks and ingests to appropriate ChromaDB collection
    • Sends notification when complete

3. Collection Mapping

PatternTarget Collection
Bohmphysicists_corpus
Greek, greekgreek_corpus
Loeb, loebgreek_corpus
Perseusgreek_corpus
(default)knowledge_base

Usage

Adding Scanned PDFs for OCR

  1. Identify skipped files in processing logs:
grep "Skipped (no text)" ~/projects/knowledge-rag/*_process.log
  1. Copy to Paperless consume folder:
cp "/path/to/scanned.pdf" ~/docker-services/paperless/consume/
  1. Name files with collection hints:
# Good naming for auto-routing:
Bohm-Paper_Title_Year.pdf physicists_corpus
Greek-Author_Work.pdf greek_corpus
Loeb-Homer_Iliad.pdf greek_corpus
  1. Wait for OCR (watch logs):
docker logs paperless --tail 50 -f | grep -i ocr
  1. Watcher auto-ingests and notifies when complete

Manual Trigger

cd ~/projects/knowledge-rag && source venv/bin/activate
python3 paperless_watcher.py

Cron Jobs

# Current cron entries
*/5 * * * *  docker exec paperless python3 manage.py document_consumer --oneshot
*/10 * * * * cd ~/projects/knowledge-rag && ./venv/bin/python3 paperless_watcher.py

Monitoring

Check Paperless Queue

curl -s -u admin:changeme123 "http://localhost:8000/api/documents/" | jq '.count'

Check Watcher Logs

tail -f ~/projects/knowledge-rag/logs/paperless_watcher.log

Check Notification File

cat ~/projects/knowledge-rag/PAPERLESS_NOTIFY.txt

State File

paperless_watch_state.json tracks:

  • processed_ids: Documents already ingested (prevents duplicates)
  • last_check: Timestamp of last poll

Troubleshooting

OCR Not Starting

# Check Paperless container
docker logs paperless --tail 100
 
# Force consumer run
docker exec paperless python3 manage.py document_consumer --oneshot

Document Has No Content

  • OCR may still be processing (wait 5-10 min for large docs)
  • Check Paperless web UI for status: http://localhost:8000

Wrong Collection

  • Rename file with correct pattern prefix before adding to consume
  • Or manually move chunks between collections

Files

FilePurpose
paperless_watcher.pyMain watcher script
paperless_to_chroma.pyManual ingest (deprecated)
paperless_watch_state.jsonState tracking
logs/paperless_watcher.logCron output
PAPERLESS_NOTIFY.txtNotification flag