Greek Text Acquisition Pipeline

Ethical acquisition of ancient Greek texts from explicitly open sources only.


⚠️ CRITICAL: robots.txt Compliance

ALWAYS check robots.txt before any automated access.

❌ DO NOT SCRAPE (robots.txt disallows)

SourceStatusAlternative
Perseus Digital Library❌ BLOCKEDUse GitHub mirror (canonical-greekLit)
CMG Online⚠️ Check firstDownload PDFs manually if allowed
Remacle.org⚠️ Check firstManual download only

✅ APPROVED SOURCES (Open/API/Download)

SourceMethodLicense
GitHub: OpenGreekAndLatingit cloneCC BY-SA ✅
GitHub: PerseusDLgit cloneCC BY-SA ✅
GitHub: First1KGreekgit cloneOpen ✅
Archive.orgAPI (internetarchive)Public Domain ✅
WikisourceAPICC ✅

📚 Approved Source Details

1. GitHub Repositories ⭐ (PRIMARY)

These are explicitly designed for download:

#!/bin/bash
# clone_greek_repos.sh - Clone APPROVED Greek text repos
 
DEST_DIR="$HOME/projects/knowledge-rag/greek_texts/github"
mkdir -p "$DEST_DIR"
cd "$DEST_DIR"
 
# First1KGreek - Greek texts including medical
git clone --depth 1 https://github.com/OpenGreekAndLatin/First1KGreek.git
 
# Perseus canonical Greek (OFFICIAL MIRROR - scraping alternative)
git clone --depth 1 https://github.com/PerseusDL/canonical-greekLit.git
 
# Open Greek and Latin main repos
git clone --depth 1 https://github.com/OpenGreekAndLatin/csel-dev.git
 
echo "Done! TEI XML files in $DEST_DIR"

What you get:

  • Complete Perseus Greek collection as TEI XML
  • First1000Years medical texts (Galen, Hippocrates, Oribasius)
  • Properly licensed, designed for reuse

2. Archive.org (Public Domain Loebs)

Pre-1927 Loeb Classical Library volumes are public domain:

#!/usr/bin/env python3
"""archive_loeb_download.py - Download public domain Loeb volumes via API"""
 
import internetarchive as ia
from pathlib import Path
 
OUTPUT_DIR = Path("greek_texts/archive_org")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
 
# Archive.org explicitly provides API for this
SEARCH_QUERIES = [
    "collection:medicalheritagelibrary AND hippocrates",
    "collection:americana AND creator:Galen AND year:[1850 TO 1926]",
]
 
def download_public_domain(query: str, max_items: int = 10):
    """Download via official Archive.org API"""
    results = ia.search_items(query)
    
    for i, item in enumerate(results):
        if i >= max_items:
            break
        identifier = item['identifier']
        print(f"Downloading: {identifier}")
        ia.download(identifier, destdir=str(OUTPUT_DIR), formats=['Text'])
 
if __name__ == "__main__":
    for q in SEARCH_QUERIES:
        download_public_domain(q)

3. Wikisource Greek Texts

Wikisource has Greek texts with API access:

# Use Wikipedia/Wikisource API - explicitly allowed
import requests
 
def get_wikisource_greek(title: str):
    """Fetch from Wikisource via official API"""
    url = "https://el.wikisource.org/w/api.php"
    params = {
        "action": "query",
        "titles": title,
        "prop": "revisions",
        "rvprop": "content",
        "format": "json"
    }
    return requests.get(url, params=params).json()

🎯 Acquisition Priorities

Tier 1: Medical/Pharmacological

AuthorWorksApproved Source
GalenMedical treatisesFirst1KGreek (GitHub)
HippocratesComplete Corpuscanonical-greekLit (GitHub)
DioscoridesDe Materia MedicaArchive.org (pre-1927)
TheophrastusHistoria Plantarumcanonical-greekLit (GitHub)

Tier 2: Philosophy

AuthorApproved Source
Aristotlecanonical-greekLit (GitHub)
Platocanonical-greekLit (GitHub)
NeoplatonistsFirst1KGreek (GitHub)

🛠️ Pipeline Architecture

┌─────────────────────────────────────────────────────────┐
│            ETHICAL GREEK ACQUISITION PIPELINE           │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │
│  │   GitHub    │  │ Archive.org │  │  Wikisource │    │
│  │  git clone  │  │    API      │  │     API     │    │
│  │  (TEI XML)  │  │  (pre-1927) │  │   (Greek)   │    │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘    │
│         │                │                │            │
│         └────────────────┼────────────────┘            │
│                          ▼                             │
│              ┌───────────────────────┐                 │
│              │   TEI/XML Parser      │                 │
│              │  - Extract Greek text │                 │
│              │  - Preserve metadata  │                 │
│              └───────────┬───────────┘                 │
│                          ▼                             │
│              ┌───────────────────────┐                 │
│              │   Chunk & Embed       │                 │
│              └───────────┬───────────┘                 │
│                          ▼                             │
│              ┌───────────────────────┐                 │
│              │    greek_corpus       │                 │
│              │     (ChromaDB)        │                 │
│              └───────────────────────┘                 │
│                                                         │
└─────────────────────────────────────────────────────────┘

📋 Checklist Before Any Acquisition

  • Check robots.txt at target domain
  • Verify license allows automated download
  • Use official APIs when available
  • Prefer git clone for GitHub-hosted corpora
  • Rate limit any HTTP requests (minimum 2s delay)
  • Log everything for audit trail

🚫 What NOT To Do

# ❌ WRONG - Never scrape sites that disallow it
requests.get("https://www.perseus.tufts.edu/hopper/text?doc=...")
 
# ✅ CORRECT - Use the official GitHub mirror instead
# git clone https://github.com/PerseusDL/canonical-greekLit.git

📊 Current Corpus Status

SourceTextsStatus
canonical-greekLit (GitHub)~500✅ Clone ready
First1KGreek (GitHub)~200✅ Clone ready
Archive.org Loebs~50✅ API ready
greek_corpus (ChromaDB)1,156 chunks✅ Exists


Ethical acquisition only. Use approved sources. Respect robots.txt.