RAG Corpus Inventory

Complete inventory of book categories available for Knowledge RAG ingestion. Source: /mnt/storage/books_organized/


📊 Summary

MetricValue
Total Categories24
Total Files8,673
Total Size~178 GB
Source/mnt/storage/books_organized/

🗂️ Category Inventory

🔴 Priority 1: Active Corpora

CategoryFilesSizeCorpus NameStatus
physicists_intellectuals3801.8Gphysicists_corpus🔄 Indexing (132/380)
Medicine17111Gmedical_corpus📋 Staged
Botany44865Mmedical_corpus📋 Staged
Biohacking714Mmedical_corpus📋 Staged
History581.6Ggreek_corpus (partial)📋 Partial
Esoteric26368Mesoteric_corpus📋 Planned

🟡 Priority 2: High Value

CategoryFilesSizePotential CorpusNotes
Science1,05528Gscience_corpusBiology, chemistry, physics
Engineering1,00834Gengineering_corpusBroad technical coverage
Tech78315Gtech_corpusProgramming, systems
Math3297.1Gmath_corpusFoundational

🟢 Priority 3: Reference

CategoryFilesSizePotential CorpusNotes
Finance722.3Gfinance_corpusTrading, economics
Survival23745Msurvival_corpusPreparedness
Food14556Mfood_corpusCooking, nutrition
Philosophy697Mphilosophy_corpusEthics, logic
Conspiracy17159Mconspiracy_corpusAlternative history
Games43543MRPG, board games

⚪ Lower Priority / Archive

CategoryFilesSizeNotes
Magazines2,22655GPeriodicals, less structured
Self-Help9103.5GMixed quality
Fiction7987.7GNovels, stories
Uncategorized6359.6GNeeds sorting
Regional52943MLocal interest
Economics1213MSmall collection
Psychology211MMinimal
Politics268MMinimal

🎯 Corpus Pipeline Status

Active Corpora

CorpusSourcesChunksStatus
quantum_biology (kruse)Kruse Patreon40,002✅ Enriched
science_corpusMixed science~7,000✅ Ready
physicists_corpusphysicists_intellectuals~17k+🔄 Indexing
osint_corpusOSINT books~2,000✅ Ready
antiquities_corpusGreek texts~1,500✅ Ready

Planned Corpora

CorpusSourcesEst. ChunksPriority
medical_corpusMedicine + Botany + Biohacking~325k🔴 Critical
greek_corpusHistory/Ancient + scraped~50k🔴 Critical
esoteric_corpusEsoteric + Occult texts~15k🟡 High
engineering_corpusEngineering~200k🟡 High
tech_corpusTech~150k🟡 High

📁 Detailed Breakdowns

Science (1,055 files, 28G)

Science/
├── Biology-Molecular/     # Molecular biology, genetics
├── Biology-General/       # General biology
├── Physics/               # Classical & modern physics
├── Physics-Quantum/       # Quantum mechanics
├── Physics-Thermo/        # Thermodynamics
├── Chemistry/             # General chemistry
└── (other subdirs)

Engineering (1,008 files, 34G)

Engineering/
├── Electrical/            # EE, circuits
├── Mechanical/            # ME, materials
├── Civil/                 # Structures
├── Nanotechnology/        # Nano-scale
├── Chemical/              # ChemE
└── (other subdirs)

Medicine (171 files, 11G)

Medicine/
├── Cardiology/            # Heart
├── Clinical/              # Clinical practice
├── Psychiatry/            # Mental health
├── Surgery/               # Surgical texts
├── Pharmacology texts
├── Anatomy atlases
└── Reference handbooks

Tech (783 files, 15G)

Tech/
├── Programming/           # Languages, frameworks
├── Systems/               # OS, infrastructure
├── Security/              # Cybersec
├── Networking/            # Networks
├── DevOps/                # CI/CD, containers
└── (other subdirs)

History/Ancient (58 files in History, Greek subset)

History/Ancient/
├── Greek grammar & language
├── Greek Magical Papyri (PGM)
├── Hermetica
├── Orphic texts
├── Classical authors
└── Ancient medicine refs

🔗 Cross-Domain Mapping

┌─────────────────────────────────────────────────────────────────┐
│                    CORPUS RELATIONSHIPS                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  greek_corpus ◄─────────► medical_corpus                        │
│       │           Pharmakon Miner                                │
│       │                    │                                     │
│       ▼                    ▼                                     │
│  esoteric_corpus    science_corpus                              │
│       │                    │                                     │
│       └────────┬───────────┘                                     │
│                ▼                                                  │
│       physicists_corpus                                          │
│                │                                                  │
│                ▼                                                  │
│       quantum_biology (Kruse)                                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

🛠️ Ingest Commands

Medical Corpus (Priority)

python patentbot.py ingest \
  --corpus medical_corpus \
  --source /mnt/storage/books_organized/Medicine/ \
  --source /mnt/storage/books_organized/Botany/ \
  --source /mnt/storage/books_organized/Biohacking/

Science Corpus

python patentbot.py ingest \
  --corpus science_corpus \
  --source /mnt/storage/books_organized/Science/

Engineering Corpus

python patentbot.py ingest \
  --corpus engineering_corpus \
  --source /mnt/storage/books_organized/Engineering/

Esoteric Corpus

python patentbot.py ingest \
  --corpus esoteric_corpus \
  --source /mnt/storage/books_organized/Esoteric/ \
  --source /mnt/storage/books_organized/History/Ancient/

📋 Ingestion Priority Queue

  1. [ACTIVE] physicists_corpus — 132/380 indexing
  2. [NEXT] medical_corpus — ~325k chunks staged
  3. [NEXT] greek_corpus — Scrapers + local texts
  4. [PLANNED] esoteric_corpus — Occult + Hermetic
  5. [PLANNED] science_corpus expansion
  6. [PLANNED] engineering_corpus
  7. [PLANNED] tech_corpus

📊 Chunk Estimates

CorpusSource FilesEst. PagesEst. Chunks
medical222~50k~325k
science1,055~200k~1.3M
engineering1,008~180k~1.2M
tech783~120k~800k
esoteric84~15k~100k
Total potential~3.7M chunks


Comprehensive inventory for multi-domain RAG system.