OCR Pipeline for RAG Ingestion

Handles scanned PDFs that fail text extraction during initial processing.

Overview

Source PDFs → Initial Extraction Attempt
                     ↓
              Text extracted? 
                /         \
              Yes          No (scanned)
               ↓            ↓
           Chunk &      Copy to Paperless
           Ingest       consume folder
                            ↓
                      Paperless OCR
                            ↓
                      Watcher cron
                            ↓
                      Chunk & Ingest
                            ↓
                      Notify Shadow

Components

1. Paperless-ngx (OCR Engine)

Location: localhost:8000 (Pop!_OS Docker)
Consume folder: ~/docker-services/paperless/consume/
Credentials: admin / changeme123
Features: GPU-accelerated OCR, automatic text extraction

2. Paperless Watcher (`paperless_watcher.py`)

Location: ~/projects/knowledge-rag/paperless_watcher.py
Schedule: Every 10 minutes via cron
Function:
- Polls Paperless API for new documents
- Matches against watch patterns (Bohm, Greek, Loeb, etc.)
- Retrieves OCR’d content
- Chunks and ingests to appropriate ChromaDB collection
- Sends notification when complete

3. Collection Mapping

Pattern	Target Collection
Bohm	physicists_corpus
Greek, greek	greek_corpus
Loeb, loeb	greek_corpus
Perseus	greek_corpus
(default)	knowledge_base

Usage

Adding Scanned PDFs for OCR

Identify skipped files in processing logs:

grep "Skipped (no text)" ~/projects/knowledge-rag/*_process.log

Copy to Paperless consume folder:

cp "/path/to/scanned.pdf" ~/docker-services/paperless/consume/

Name files with collection hints:

# Good naming for auto-routing:
Bohm-Paper_Title_Year.pdf      → physicists_corpus
Greek-Author_Work.pdf          → greek_corpus
Loeb-Homer_Iliad.pdf           → greek_corpus

Wait for OCR (watch logs):

docker logs paperless --tail 50 -f | grep -i ocr

Watcher auto-ingests and notifies when complete

Manual Trigger

cd ~/projects/knowledge-rag && source venv/bin/activate
python3 paperless_watcher.py

Cron Jobs

# Current cron entries
*/5 * * * *  docker exec paperless python3 manage.py document_consumer --oneshot
*/10 * * * * cd ~/projects/knowledge-rag && ./venv/bin/python3 paperless_watcher.py

Monitoring

Check Paperless Queue

curl -s -u admin:changeme123 "http://localhost:8000/api/documents/" | jq '.count'

Check Watcher Logs

tail -f ~/projects/knowledge-rag/logs/paperless_watcher.log

Check Notification File

cat ~/projects/knowledge-rag/PAPERLESS_NOTIFY.txt

State File

paperless_watch_state.json tracks:

processed_ids: Documents already ingested (prevents duplicates)
last_check: Timestamp of last poll

Troubleshooting

OCR Not Starting

# Check Paperless container
docker logs paperless --tail 100
 
# Force consumer run
docker exec paperless python3 manage.py document_consumer --oneshot

Document Has No Content

OCR may still be processing (wait 5-10 min for large docs)
Check Paperless web UI for status: http://localhost:8000

Wrong Collection

Rename file with correct pattern prefix before adding to consume
Or manually move chunks between collections

Files

File	Purpose
`paperless_watcher.py`	Main watcher script
`paperless_to_chroma.py`	Manual ingest (deprecated)
`paperless_watch_state.json`	State tracking
`logs/paperless_watcher.log`	Cron output
`PAPERLESS_NOTIFY.txt`	Notification flag

Quartz 4

Explorer

📄 OCR Pipeline for RAG Ingestion

OCR Pipeline for RAG Ingestion

Overview

Components

1. Paperless-ngx (OCR Engine)

2. Paperless Watcher (`paperless_watcher.py`)

3. Collection Mapping

Usage

Adding Scanned PDFs for OCR

Manual Trigger

Cron Jobs

Monitoring

Check Paperless Queue

Check Watcher Logs

Check Notification File

State File

Troubleshooting

OCR Not Starting

Document Has No Content

Wrong Collection

Files

Graph View

Table of Contents

Quartz 4

Explorer

📄 OCR Pipeline for RAG Ingestion

OCR Pipeline for RAG Ingestion

Overview

Components

1. Paperless-ngx (OCR Engine)

2. Paperless Watcher (paperless_watcher.py)

3. Collection Mapping

Usage

Adding Scanned PDFs for OCR

Manual Trigger

Cron Jobs

Monitoring

Check Paperless Queue

Check Watcher Logs

Check Notification File

State File

Troubleshooting

OCR Not Starting

Document Has No Content

Wrong Collection

Files

Related

Graph View

Table of Contents

2. Paperless Watcher (`paperless_watcher.py`)