Academic & Research Resource Sites

Currently Integrated Sources

Greek Texts

SourceURLScraperStatus
Perseus Digital Libraryhttps://www.perseus.tufts.eduperseus_scraper.pyActive
First1KGreek (GitHub)https://github.com/OpenGreekAndLatin/First1KGreekclone_greek_repos.shActive
Canonical Greek Lithttps://github.com/PerseusDL/canonical-greekLitclone_greek_repos.shActive
Internet Archive (Loeb)https://archive.orgarchive_loeb_scraper.pyActive

Social/Blog Content

SourceURLScraperStatus
X/Twitterhttps://x.comx-scraper.js, x-scraper-deep.jsActive
Patreonhttps://patreon.compatreon-scraper.jsActive

Overnight Cron Jobs

  • greek_overnight.sh - Daily acquisition of new Greek texts

Paper Acquisition Channels (To Research)

Z-Library / Zeta Channel

Anna’s Archive

Library Genesis (LibGen)

Sci-Hub

arXiv

PubMed Central

Semantic Scholar


Local Book Corpus

Location

~/projects/knowledge-rag/scraped/
β”œβ”€β”€ github_greek/
β”‚   β”œβ”€β”€ canonical-greekLit/
β”‚   └── First1KGreek/
β”œβ”€β”€ loeb/
β”œβ”€β”€ perseus/
└── x/

ChromaDB Collections

CollectionChunksDescription
quantum_biology378,519Dr. Kruse content
physicists_corpus50,642Physics papers
greek_corpusTBDGreek texts
knowledge_baseTBDGeneral knowledge
+ 6 more-Various domains

Citation Extraction Pipeline (TODO)

Goals

  1. Extract citations from local book corpus
  2. Cross-reference with paper sources
  3. Auto-acquire missing referenced papers
  4. Build citation graph in ChromaDB

Proposed Pipeline

Local Books β†’ Citation Extractor β†’ DOI/Title List
                                        ↓
                              Paper Acquisition
                              (Z-Lib/Anna's/Sci-Hub)
                                        ↓
                              ChromaDB Ingestion

Citation Patterns to Extract

  • DOI: 10.xxxx/xxxxx
  • PubMed: PMID: xxxxxxxx
  • arXiv: arXiv:xxxx.xxxxx
  • ISBN: 978-x-xxxx-xxxx-x
  • Standard citations: Author et al. (Year)

Cron Jobs Needed

Daily

  • Greek overnight (EXISTS: greek_overnight.sh)
  • arXiv new papers (category-based)
  • PubMed new papers (keyword-based)

Weekly

  • Citation gap analysis
  • Missing paper acquisition
  • Corpus integrity check

On-Demand

  • DOI-based paper fetch
  • Bulk acquisition from reading list

Next Steps

  1. Research Z-Library API - Account setup, rate limits, automation options
  2. Build citation extractor - Regex + NLP for citation parsing
  3. Create paper acquisition script - Multi-source fallback (Anna’s β†’ Z-Lib β†’ Sci-Hub)
  4. Set up cron jobs - Daily/weekly acquisition schedules
  5. Cross-reference pipeline - Match citations to existing corpus, flag missing

Last updated: 2026-02-03