Tech Corpus Pipeline

Strategy for processing tech library assets into specialized RAG corpora.


πŸ“ Source Assets

CategoryFilesSizeLocation
Tech (root)75715G/mnt/storage/books_organized/Tech/
Tech/AI-ML~50~2GDedicated AI/ML subfolder
Tech/Security~30~1GDedicated security subfolder
Tech/Programming~100~3GLanguages, frameworks

🎯 Corpus Strategy

1. osint_corpus Enhancement

  • Source: OSINT-specific books (already indexed)
  • Add: Hacking/OSINT crossover from Tech root
  • Books to add:
    • Practical Approach to Open Source Intelligence Vol 1 & 2
    • Grey Area: Dark Web Data Collection and OSINT
    • Hacking Web Intelligence
    • Digital Forensics for Enterprises Beyond Kali Linux

2. ai_corpus (NEW) β€” Dreadbot Self-Improvement

  • Purpose: Self-replicating upgrade reference for Dreadbot
  • Source: /mnt/storage/books_organized/Tech/AI-ML/ + AI books from root
  • Key texts:
    • Building AI Agents with LLMs, RAG, and Knowledge Graphs
    • Building Agentic AI Systems
    • Agentic AI: Theories and Practices
    • Building LLM Agents with RAG, Knowledge Graphs and Reflection
    • LLMOps: Managing Large Language Models in Production
    • Generative AI with LangChain
    • Knowledge Graphs and LLMs in Action
    • Large Language Models: The Hard Parts

3. security_corpus β€” NTS Security Library

  • Purpose: NTS security consulting reference
  • Source: /mnt/storage/books_organized/Tech/Security/ + security books from root
  • Key texts:
    • CompTIA Security+ guides
    • CISM/CISA study guides
    • Metasploit: The Penetration Tester’s Guide
    • Santos: Redefining Hacking (comprehensive)
    • Infrastructure Attack Strategies for Ethical Hacking
    • Pentesting Active Directory
    • Vulnerability Assessment and Penetration Testing (VAPT)
    • Offensive Security Using Python
    • Cryptography Algorithms

4. tech_corpus β€” General Tech Reference

  • Purpose: NTS infrastructure knowledge base
  • Source: Remaining Tech books
  • Focus areas:
    • Cloud (AWS, Azure, GCP)
    • DevOps/Kubernetes
    • Programming best practices
    • Networking

πŸ“‹ Ingest Commands

cd ~/projects/knowledge-rag
source venv/bin/activate
 
# AI Corpus (Dreadbot upgrades)
python patentbot.py ingest \
  --corpus ai_corpus \
  --source /mnt/storage/books_organized/Tech/AI-ML/ \
  --filter "AI|LLM|Agent|Machine Learning|Deep Learning|Neural"
 
# Security Corpus (NTS)
python patentbot.py ingest \
  --corpus security_corpus \
  --source /mnt/storage/books_organized/Tech/Security/ \
  --filter "Security|Hacking|Penetration|CISM|CompTIA|Forensic"
 
# Tech General (NTS infrastructure)
python patentbot.py ingest \
  --corpus tech_corpus \
  --source /mnt/storage/books_organized/Tech/ \
  --exclude "AI-ML|Security"

πŸ”— NTS Integration

CorpusNTS ServiceUse Case
security_corpusSecurity ConsultingPentest methodology, compliance
ai_corpusAI ConsultingImplementation guidance
tech_corpusInfrastructureCloud/DevOps best practices
osint_corpusOSINT ServicesInvestigation techniques

πŸ€– Dreadbot Self-Improvement Loop

ai_corpus β†’ Query patterns β†’ Identify gaps β†’ 
  β†’ Read source material β†’ Update AGENTS.md/TOOLS.md β†’
  β†’ Implement improvements β†’ Log to memory/

Self-reference queries:

  • β€œHow to improve RAG retrieval accuracy”
  • β€œBest practices for agentic AI systems”
  • β€œLLM prompt engineering techniques”
  • β€œKnowledge graph integration patterns”

πŸ“Š Priority Queue

  1. βœ… physicists_corpus β€” Currently indexing
  2. πŸ”œ ai_corpus β€” Dreadbot upgrades (CRITICAL)
  3. πŸ”œ security_corpus β€” NTS launch prep
  4. πŸ“‹ tech_corpus β€” General reference
  5. πŸ“‹ osint_corpus expansion


Build the knowledge. Upgrade the bot. Secure the clients.