Tech Corpus Pipeline
Strategy for processing tech library assets into specialized RAG corpora.
π Source Assets
| Category | Files | Size | Location |
|---|---|---|---|
| Tech (root) | 757 | 15G | /mnt/storage/books_organized/Tech/ |
| Tech/AI-ML | ~50 | ~2G | Dedicated AI/ML subfolder |
| Tech/Security | ~30 | ~1G | Dedicated security subfolder |
| Tech/Programming | ~100 | ~3G | Languages, frameworks |
π― Corpus Strategy
1. osint_corpus Enhancement
- Source: OSINT-specific books (already indexed)
- Add: Hacking/OSINT crossover from Tech root
- Books to add:
- Practical Approach to Open Source Intelligence Vol 1 & 2
- Grey Area: Dark Web Data Collection and OSINT
- Hacking Web Intelligence
- Digital Forensics for Enterprises Beyond Kali Linux
2. ai_corpus (NEW) β Dreadbot Self-Improvement
- Purpose: Self-replicating upgrade reference for Dreadbot
- Source:
/mnt/storage/books_organized/Tech/AI-ML/+ AI books from root - Key texts:
- Building AI Agents with LLMs, RAG, and Knowledge Graphs
- Building Agentic AI Systems
- Agentic AI: Theories and Practices
- Building LLM Agents with RAG, Knowledge Graphs and Reflection
- LLMOps: Managing Large Language Models in Production
- Generative AI with LangChain
- Knowledge Graphs and LLMs in Action
- Large Language Models: The Hard Parts
3. security_corpus β NTS Security Library
- Purpose: NTS security consulting reference
- Source:
/mnt/storage/books_organized/Tech/Security/+ security books from root - Key texts:
- CompTIA Security+ guides
- CISM/CISA study guides
- Metasploit: The Penetration Testerβs Guide
- Santos: Redefining Hacking (comprehensive)
- Infrastructure Attack Strategies for Ethical Hacking
- Pentesting Active Directory
- Vulnerability Assessment and Penetration Testing (VAPT)
- Offensive Security Using Python
- Cryptography Algorithms
4. tech_corpus β General Tech Reference
- Purpose: NTS infrastructure knowledge base
- Source: Remaining Tech books
- Focus areas:
- Cloud (AWS, Azure, GCP)
- DevOps/Kubernetes
- Programming best practices
- Networking
π Ingest Commands
cd ~/projects/knowledge-rag
source venv/bin/activate
# AI Corpus (Dreadbot upgrades)
python patentbot.py ingest \
--corpus ai_corpus \
--source /mnt/storage/books_organized/Tech/AI-ML/ \
--filter "AI|LLM|Agent|Machine Learning|Deep Learning|Neural"
# Security Corpus (NTS)
python patentbot.py ingest \
--corpus security_corpus \
--source /mnt/storage/books_organized/Tech/Security/ \
--filter "Security|Hacking|Penetration|CISM|CompTIA|Forensic"
# Tech General (NTS infrastructure)
python patentbot.py ingest \
--corpus tech_corpus \
--source /mnt/storage/books_organized/Tech/ \
--exclude "AI-ML|Security"π NTS Integration
| Corpus | NTS Service | Use Case |
|---|---|---|
security_corpus | Security Consulting | Pentest methodology, compliance |
ai_corpus | AI Consulting | Implementation guidance |
tech_corpus | Infrastructure | Cloud/DevOps best practices |
osint_corpus | OSINT Services | Investigation techniques |
π€ Dreadbot Self-Improvement Loop
ai_corpus β Query patterns β Identify gaps β
β Read source material β Update AGENTS.md/TOOLS.md β
β Implement improvements β Log to memory/
Self-reference queries:
- βHow to improve RAG retrieval accuracyβ
- βBest practices for agentic AI systemsβ
- βLLM prompt engineering techniquesβ
- βKnowledge graph integration patternsβ
π Priority Queue
- β
physicists_corpusβ Currently indexing - π
ai_corpusβ Dreadbot upgrades (CRITICAL) - π
security_corpusβ NTS launch prep - π
tech_corpusβ General reference - π
osint_corpusexpansion
π Related
Build the knowledge. Upgrade the bot. Secure the clients.