Academic & Research Resource Sites
Currently Integrated Sources
Greek Texts
| Source | URL | Scraper | Status |
|---|---|---|---|
| Perseus Digital Library | https://www.perseus.tufts.edu | perseus_scraper.py | Active |
| First1KGreek (GitHub) | https://github.com/OpenGreekAndLatin/First1KGreek | clone_greek_repos.sh | Active |
| Canonical Greek Lit | https://github.com/PerseusDL/canonical-greekLit | clone_greek_repos.sh | Active |
| Internet Archive (Loeb) | https://archive.org | archive_loeb_scraper.py | Active |
Social/Blog Content
| Source | URL | Scraper | Status |
|---|---|---|---|
| X/Twitter | https://x.com | x-scraper.js, x-scraper-deep.js | Active |
| Patreon | https://patreon.com | patreon-scraper.js | Active |
Overnight Cron Jobs
greek_overnight.sh- Daily acquisition of new Greek texts
Paper Acquisition Channels (To Research)
Z-Library / Zeta Channel
- Primary: https://z-lib.io / https://singlelogin.re
- Mirrors: https://zlibrary-global.se
- API: Needs research - personal account required
- Status: β οΈ NOT YET INTEGRATED
Annaβs Archive
- URL: https://annas-archive.org
- Features: Aggregates Z-Lib, LibGen, Sci-Hub
- API: https://annas-archive.org/datasets
- Status: β οΈ NOT YET INTEGRATED
Library Genesis (LibGen)
- URL: https://libgen.is / https://libgen.rs
- API: http://libgen.is/json.php
- Mirrors: https://libgen.li
- Status: β οΈ NOT YET INTEGRATED
Sci-Hub
- URL: https://sci-hub.se / https://sci-hub.st
- Usage: DOI-based paper retrieval
- Status: β οΈ NOT YET INTEGRATED
arXiv
- URL: https://arxiv.org
- API: https://export.arxiv.org/api/query
- Features: Open preprints, bulk download
- Status: β οΈ NOT YET INTEGRATED
PubMed Central
- URL: https://www.ncbi.nlm.nih.gov/pmc/
- API: E-utilities (free, rate-limited)
- Status: β οΈ NOT YET INTEGRATED
Semantic Scholar
- URL: https://www.semanticscholar.org
- API: https://api.semanticscholar.org (free tier)
- Features: Citation graphs, paper recommendations
- Status: β οΈ NOT YET INTEGRATED
Local Book Corpus
Location
~/projects/knowledge-rag/scraped/
βββ github_greek/
β βββ canonical-greekLit/
β βββ First1KGreek/
βββ loeb/
βββ perseus/
βββ x/
ChromaDB Collections
| Collection | Chunks | Description |
|---|---|---|
| quantum_biology | 378,519 | Dr. Kruse content |
| physicists_corpus | 50,642 | Physics papers |
| greek_corpus | TBD | Greek texts |
| knowledge_base | TBD | General knowledge |
| + 6 more | - | Various domains |
Citation Extraction Pipeline (TODO)
Goals
- Extract citations from local book corpus
- Cross-reference with paper sources
- Auto-acquire missing referenced papers
- Build citation graph in ChromaDB
Proposed Pipeline
Local Books β Citation Extractor β DOI/Title List
β
Paper Acquisition
(Z-Lib/Anna's/Sci-Hub)
β
ChromaDB Ingestion
Citation Patterns to Extract
- DOI:
10.xxxx/xxxxx - PubMed:
PMID: xxxxxxxx - arXiv:
arXiv:xxxx.xxxxx - ISBN:
978-x-xxxx-xxxx-x - Standard citations:
Author et al. (Year)
Cron Jobs Needed
Daily
- Greek overnight (EXISTS:
greek_overnight.sh) - arXiv new papers (category-based)
- PubMed new papers (keyword-based)
Weekly
- Citation gap analysis
- Missing paper acquisition
- Corpus integrity check
On-Demand
- DOI-based paper fetch
- Bulk acquisition from reading list
Next Steps
- Research Z-Library API - Account setup, rate limits, automation options
- Build citation extractor - Regex + NLP for citation parsing
- Create paper acquisition script - Multi-source fallback (Annaβs β Z-Lib β Sci-Hub)
- Set up cron jobs - Daily/weekly acquisition schedules
- Cross-reference pipeline - Match citations to existing corpus, flag missing
Last updated: 2026-02-03