RAG Corpus Inventory
Complete inventory of book categories available for Knowledge RAG ingestion. Source: /mnt/storage/books_organized/
📊 Summary
| Metric | Value |
|---|
| Total Categories | 24 |
| Total Files | 8,673 |
| Total Size | ~178 GB |
| Source | /mnt/storage/books_organized/ |
🗂️ Category Inventory
🔴 Priority 1: Active Corpora
| Category | Files | Size | Corpus Name | Status |
|---|
| physicists_intellectuals | 380 | 1.8G | physicists_corpus | 🔄 Indexing (132/380) |
| Medicine | 171 | 11G | medical_corpus | 📋 Staged |
| Botany | 44 | 865M | medical_corpus | 📋 Staged |
| Biohacking | 7 | 14M | medical_corpus | 📋 Staged |
| History | 58 | 1.6G | greek_corpus (partial) | 📋 Partial |
| Esoteric | 26 | 368M | esoteric_corpus | 📋 Planned |
🟡 Priority 2: High Value
| Category | Files | Size | Potential Corpus | Notes |
|---|
| Science | 1,055 | 28G | science_corpus | Biology, chemistry, physics |
| Engineering | 1,008 | 34G | engineering_corpus | Broad technical coverage |
| Tech | 783 | 15G | tech_corpus | Programming, systems |
| Math | 329 | 7.1G | math_corpus | Foundational |
🟢 Priority 3: Reference
| Category | Files | Size | Potential Corpus | Notes |
|---|
| Finance | 72 | 2.3G | finance_corpus | Trading, economics |
| Survival | 23 | 745M | survival_corpus | Preparedness |
| Food | 14 | 556M | food_corpus | Cooking, nutrition |
| Philosophy | 6 | 97M | philosophy_corpus | Ethics, logic |
| Conspiracy | 17 | 159M | conspiracy_corpus | Alternative history |
| Games | 43 | 543M | — | RPG, board games |
⚪ Lower Priority / Archive
| Category | Files | Size | Notes |
|---|
| Magazines | 2,226 | 55G | Periodicals, less structured |
| Self-Help | 910 | 3.5G | Mixed quality |
| Fiction | 798 | 7.7G | Novels, stories |
| Uncategorized | 635 | 9.6G | Needs sorting |
| Regional | 52 | 943M | Local interest |
| Economics | 12 | 13M | Small collection |
| Psychology | 2 | 11M | Minimal |
| Politics | 2 | 68M | Minimal |
🎯 Corpus Pipeline Status
Active Corpora
| Corpus | Sources | Chunks | Status |
|---|
quantum_biology (kruse) | Kruse Patreon | 40,002 | ✅ Enriched |
science_corpus | Mixed science | ~7,000 | ✅ Ready |
physicists_corpus | physicists_intellectuals | ~17k+ | 🔄 Indexing |
osint_corpus | OSINT books | ~2,000 | ✅ Ready |
antiquities_corpus | Greek texts | ~1,500 | ✅ Ready |
Planned Corpora
| Corpus | Sources | Est. Chunks | Priority |
|---|
medical_corpus | Medicine + Botany + Biohacking | ~325k | 🔴 Critical |
greek_corpus | History/Ancient + scraped | ~50k | 🔴 Critical |
esoteric_corpus | Esoteric + Occult texts | ~15k | 🟡 High |
engineering_corpus | Engineering | ~200k | 🟡 High |
tech_corpus | Tech | ~150k | 🟡 High |
📁 Detailed Breakdowns
Science (1,055 files, 28G)
Science/
├── Biology-Molecular/ # Molecular biology, genetics
├── Biology-General/ # General biology
├── Physics/ # Classical & modern physics
├── Physics-Quantum/ # Quantum mechanics
├── Physics-Thermo/ # Thermodynamics
├── Chemistry/ # General chemistry
└── (other subdirs)
Engineering (1,008 files, 34G)
Engineering/
├── Electrical/ # EE, circuits
├── Mechanical/ # ME, materials
├── Civil/ # Structures
├── Nanotechnology/ # Nano-scale
├── Chemical/ # ChemE
└── (other subdirs)
Medicine (171 files, 11G)
Medicine/
├── Cardiology/ # Heart
├── Clinical/ # Clinical practice
├── Psychiatry/ # Mental health
├── Surgery/ # Surgical texts
├── Pharmacology texts
├── Anatomy atlases
└── Reference handbooks
Tech (783 files, 15G)
Tech/
├── Programming/ # Languages, frameworks
├── Systems/ # OS, infrastructure
├── Security/ # Cybersec
├── Networking/ # Networks
├── DevOps/ # CI/CD, containers
└── (other subdirs)
History/Ancient (58 files in History, Greek subset)
History/Ancient/
├── Greek grammar & language
├── Greek Magical Papyri (PGM)
├── Hermetica
├── Orphic texts
├── Classical authors
└── Ancient medicine refs
🔗 Cross-Domain Mapping
┌─────────────────────────────────────────────────────────────────┐
│ CORPUS RELATIONSHIPS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ greek_corpus ◄─────────► medical_corpus │
│ │ Pharmakon Miner │
│ │ │ │
│ ▼ ▼ │
│ esoteric_corpus science_corpus │
│ │ │ │
│ └────────┬───────────┘ │
│ ▼ │
│ physicists_corpus │
│ │ │
│ ▼ │
│ quantum_biology (Kruse) │
│ │
└─────────────────────────────────────────────────────────────────┘
🛠️ Ingest Commands
Medical Corpus (Priority)
python patentbot.py ingest \
--corpus medical_corpus \
--source /mnt/storage/books_organized/Medicine/ \
--source /mnt/storage/books_organized/Botany/ \
--source /mnt/storage/books_organized/Biohacking/
Science Corpus
python patentbot.py ingest \
--corpus science_corpus \
--source /mnt/storage/books_organized/Science/
Engineering Corpus
python patentbot.py ingest \
--corpus engineering_corpus \
--source /mnt/storage/books_organized/Engineering/
Esoteric Corpus
python patentbot.py ingest \
--corpus esoteric_corpus \
--source /mnt/storage/books_organized/Esoteric/ \
--source /mnt/storage/books_organized/History/Ancient/
📋 Ingestion Priority Queue
- [ACTIVE]
physicists_corpus — 132/380 indexing
- [NEXT]
medical_corpus — ~325k chunks staged
- [NEXT]
greek_corpus — Scrapers + local texts
- [PLANNED]
esoteric_corpus — Occult + Hermetic
- [PLANNED]
science_corpus expansion
- [PLANNED]
engineering_corpus
- [PLANNED]
tech_corpus
📊 Chunk Estimates
| Corpus | Source Files | Est. Pages | Est. Chunks |
|---|
| medical | 222 | ~50k | ~325k |
| science | 1,055 | ~200k | ~1.3M |
| engineering | 1,008 | ~180k | ~1.2M |
| tech | 783 | ~120k | ~800k |
| esoteric | 84 | ~15k | ~100k |
| Total potential | — | — | ~3.7M chunks |
Comprehensive inventory for multi-domain RAG system.