Greek Text Acquisition Pipeline
Ethical acquisition of ancient Greek texts from explicitly open sources only.
⚠️ CRITICAL: robots.txt Compliance
ALWAYS check robots.txt before any automated access.
❌ DO NOT SCRAPE (robots.txt disallows)
| Source | Status | Alternative |
|---|---|---|
| Perseus Digital Library | ❌ BLOCKED | Use GitHub mirror (canonical-greekLit) |
| CMG Online | ⚠️ Check first | Download PDFs manually if allowed |
| Remacle.org | ⚠️ Check first | Manual download only |
✅ APPROVED SOURCES (Open/API/Download)
| Source | Method | License |
|---|---|---|
| GitHub: OpenGreekAndLatin | git clone | CC BY-SA ✅ |
| GitHub: PerseusDL | git clone | CC BY-SA ✅ |
| GitHub: First1KGreek | git clone | Open ✅ |
| Archive.org | API (internetarchive) | Public Domain ✅ |
| Wikisource | API | CC ✅ |
📚 Approved Source Details
1. GitHub Repositories ⭐ (PRIMARY)
These are explicitly designed for download:
#!/bin/bash
# clone_greek_repos.sh - Clone APPROVED Greek text repos
DEST_DIR="$HOME/projects/knowledge-rag/greek_texts/github"
mkdir -p "$DEST_DIR"
cd "$DEST_DIR"
# First1KGreek - Greek texts including medical
git clone --depth 1 https://github.com/OpenGreekAndLatin/First1KGreek.git
# Perseus canonical Greek (OFFICIAL MIRROR - scraping alternative)
git clone --depth 1 https://github.com/PerseusDL/canonical-greekLit.git
# Open Greek and Latin main repos
git clone --depth 1 https://github.com/OpenGreekAndLatin/csel-dev.git
echo "Done! TEI XML files in $DEST_DIR"What you get:
- Complete Perseus Greek collection as TEI XML
- First1000Years medical texts (Galen, Hippocrates, Oribasius)
- Properly licensed, designed for reuse
2. Archive.org (Public Domain Loebs)
Pre-1927 Loeb Classical Library volumes are public domain:
#!/usr/bin/env python3
"""archive_loeb_download.py - Download public domain Loeb volumes via API"""
import internetarchive as ia
from pathlib import Path
OUTPUT_DIR = Path("greek_texts/archive_org")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
# Archive.org explicitly provides API for this
SEARCH_QUERIES = [
"collection:medicalheritagelibrary AND hippocrates",
"collection:americana AND creator:Galen AND year:[1850 TO 1926]",
]
def download_public_domain(query: str, max_items: int = 10):
"""Download via official Archive.org API"""
results = ia.search_items(query)
for i, item in enumerate(results):
if i >= max_items:
break
identifier = item['identifier']
print(f"Downloading: {identifier}")
ia.download(identifier, destdir=str(OUTPUT_DIR), formats=['Text'])
if __name__ == "__main__":
for q in SEARCH_QUERIES:
download_public_domain(q)3. Wikisource Greek Texts
Wikisource has Greek texts with API access:
# Use Wikipedia/Wikisource API - explicitly allowed
import requests
def get_wikisource_greek(title: str):
"""Fetch from Wikisource via official API"""
url = "https://el.wikisource.org/w/api.php"
params = {
"action": "query",
"titles": title,
"prop": "revisions",
"rvprop": "content",
"format": "json"
}
return requests.get(url, params=params).json()🎯 Acquisition Priorities
Tier 1: Medical/Pharmacological
| Author | Works | Approved Source |
|---|---|---|
| Galen | Medical treatises | First1KGreek (GitHub) |
| Hippocrates | Complete Corpus | canonical-greekLit (GitHub) |
| Dioscorides | De Materia Medica | Archive.org (pre-1927) |
| Theophrastus | Historia Plantarum | canonical-greekLit (GitHub) |
Tier 2: Philosophy
| Author | Approved Source |
|---|---|
| Aristotle | canonical-greekLit (GitHub) |
| Plato | canonical-greekLit (GitHub) |
| Neoplatonists | First1KGreek (GitHub) |
🛠️ Pipeline Architecture
┌─────────────────────────────────────────────────────────┐
│ ETHICAL GREEK ACQUISITION PIPELINE │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ GitHub │ │ Archive.org │ │ Wikisource │ │
│ │ git clone │ │ API │ │ API │ │
│ │ (TEI XML) │ │ (pre-1927) │ │ (Greek) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ TEI/XML Parser │ │
│ │ - Extract Greek text │ │
│ │ - Preserve metadata │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Chunk & Embed │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ greek_corpus │ │
│ │ (ChromaDB) │ │
│ └───────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
📋 Checklist Before Any Acquisition
- Check robots.txt at target domain
- Verify license allows automated download
- Use official APIs when available
- Prefer git clone for GitHub-hosted corpora
- Rate limit any HTTP requests (minimum 2s delay)
- Log everything for audit trail
🚫 What NOT To Do
# ❌ WRONG - Never scrape sites that disallow it
requests.get("https://www.perseus.tufts.edu/hopper/text?doc=...")
# ✅ CORRECT - Use the official GitHub mirror instead
# git clone https://github.com/PerseusDL/canonical-greekLit.git📊 Current Corpus Status
| Source | Texts | Status |
|---|---|---|
| canonical-greekLit (GitHub) | ~500 | ✅ Clone ready |
| First1KGreek (GitHub) | ~200 | ✅ Clone ready |
| Archive.org Loebs | ~50 | ✅ API ready |
| greek_corpus (ChromaDB) | 1,156 chunks | ✅ Exists |
🔗 Related
Ethical acquisition only. Use approved sources. Respect robots.txt.