Greek Text Acquisition Pipeline

Ethical acquisition of ancient Greek texts from explicitly open sources only.

⚠️ CRITICAL: robots.txt Compliance

ALWAYS check robots.txt before any automated access.

❌ DO NOT SCRAPE (robots.txt disallows)

Source	Status	Alternative
Perseus Digital Library	❌ BLOCKED	Use GitHub mirror (canonical-greekLit)
CMG Online	⚠️ Check first	Download PDFs manually if allowed
Remacle.org	⚠️ Check first	Manual download only

✅ APPROVED SOURCES (Open/API/Download)

Source	Method	License
GitHub: OpenGreekAndLatin	`git clone`	CC BY-SA ✅
GitHub: PerseusDL	`git clone`	CC BY-SA ✅
GitHub: First1KGreek	`git clone`	Open ✅
Archive.org	API (`internetarchive`)	Public Domain ✅
Wikisource	API	CC ✅

📚 Approved Source Details

1. GitHub Repositories ⭐ (PRIMARY)

These are explicitly designed for download:

#!/bin/bash
# clone_greek_repos.sh - Clone APPROVED Greek text repos
 
DEST_DIR="$HOME/projects/knowledge-rag/greek_texts/github"
mkdir -p "$DEST_DIR"
cd "$DEST_DIR"
 
# First1KGreek - Greek texts including medical
git clone --depth 1 https://github.com/OpenGreekAndLatin/First1KGreek.git
 
# Perseus canonical Greek (OFFICIAL MIRROR - scraping alternative)
git clone --depth 1 https://github.com/PerseusDL/canonical-greekLit.git
 
# Open Greek and Latin main repos
git clone --depth 1 https://github.com/OpenGreekAndLatin/csel-dev.git
 
echo "Done! TEI XML files in $DEST_DIR"

What you get:

Complete Perseus Greek collection as TEI XML
First1000Years medical texts (Galen, Hippocrates, Oribasius)
Properly licensed, designed for reuse

2. Archive.org (Public Domain Loebs)

Pre-1927 Loeb Classical Library volumes are public domain:

#!/usr/bin/env python3
"""archive_loeb_download.py - Download public domain Loeb volumes via API"""
 
import internetarchive as ia
from pathlib import Path
 
OUTPUT_DIR = Path("greek_texts/archive_org")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
 
# Archive.org explicitly provides API for this
SEARCH_QUERIES = [
    "collection:medicalheritagelibrary AND hippocrates",
    "collection:americana AND creator:Galen AND year:[1850 TO 1926]",
]
 
def download_public_domain(query: str, max_items: int = 10):
    """Download via official Archive.org API"""
    results = ia.search_items(query)
    
    for i, item in enumerate(results):
        if i >= max_items:
            break
        identifier = item['identifier']
        print(f"Downloading: {identifier}")
        ia.download(identifier, destdir=str(OUTPUT_DIR), formats=['Text'])
 
if __name__ == "__main__":
    for q in SEARCH_QUERIES:
        download_public_domain(q)

3. Wikisource Greek Texts

Wikisource has Greek texts with API access:

# Use Wikipedia/Wikisource API - explicitly allowed
import requests
 
def get_wikisource_greek(title: str):
    """Fetch from Wikisource via official API"""
    url = "https://el.wikisource.org/w/api.php"
    params = {
        "action": "query",
        "titles": title,
        "prop": "revisions",
        "rvprop": "content",
        "format": "json"
    }
    return requests.get(url, params=params).json()

🎯 Acquisition Priorities

Tier 1: Medical/Pharmacological

Author	Works	Approved Source
Galen	Medical treatises	First1KGreek (GitHub)
Hippocrates	Complete Corpus	canonical-greekLit (GitHub)
Dioscorides	De Materia Medica	Archive.org (pre-1927)
Theophrastus	Historia Plantarum	canonical-greekLit (GitHub)

Tier 2: Philosophy

Author	Approved Source
Aristotle	canonical-greekLit (GitHub)
Plato	canonical-greekLit (GitHub)
Neoplatonists	First1KGreek (GitHub)

🛠️ Pipeline Architecture

┌─────────────────────────────────────────────────────────┐
│            ETHICAL GREEK ACQUISITION PIPELINE           │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │
│  │   GitHub    │  │ Archive.org │  │  Wikisource │    │
│  │  git clone  │  │    API      │  │     API     │    │
│  │  (TEI XML)  │  │  (pre-1927) │  │   (Greek)   │    │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘    │
│         │                │                │            │
│         └────────────────┼────────────────┘            │
│                          ▼                             │
│              ┌───────────────────────┐                 │
│              │   TEI/XML Parser      │                 │
│              │  - Extract Greek text │                 │
│              │  - Preserve metadata  │                 │
│              └───────────┬───────────┘                 │
│                          ▼                             │
│              ┌───────────────────────┐                 │
│              │   Chunk & Embed       │                 │
│              └───────────┬───────────┘                 │
│                          ▼                             │
│              ┌───────────────────────┐                 │
│              │    greek_corpus       │                 │
│              │     (ChromaDB)        │                 │
│              └───────────────────────┘                 │
│                                                         │
└─────────────────────────────────────────────────────────┘

📋 Checklist Before Any Acquisition

Check robots.txt at target domain
Verify license allows automated download
Use official APIs when available
Prefer git clone for GitHub-hosted corpora
Rate limit any HTTP requests (minimum 2s delay)
Log everything for audit trail

🚫 What NOT To Do

# ❌ WRONG - Never scrape sites that disallow it
requests.get("https://www.perseus.tufts.edu/hopper/text?doc=...")
 
# ✅ CORRECT - Use the official GitHub mirror instead
# git clone https://github.com/PerseusDL/canonical-greekLit.git

📊 Current Corpus Status

Source	Texts	Status
canonical-greekLit (GitHub)	~500	✅ Clone ready
First1KGreek (GitHub)	~200	✅ Clone ready
Archive.org Loebs	~50	✅ API ready
greek_corpus (ChromaDB)	1,156 chunks	✅ Exists

Ethical acquisition only. Use approved sources. Respect robots.txt.

Quartz 4

Explorer

📜 Greek Text Acquisition Pipeline

Greek Text Acquisition Pipeline

⚠️ CRITICAL: robots.txt Compliance

❌ DO NOT SCRAPE (robots.txt disallows)

✅ APPROVED SOURCES (Open/API/Download)

📚 Approved Source Details

1. GitHub Repositories ⭐ (PRIMARY)

2. Archive.org (Public Domain Loebs)

3. Wikisource Greek Texts

🎯 Acquisition Priorities

Tier 1: Medical/Pharmacological

Tier 2: Philosophy

🛠️ Pipeline Architecture

📋 Checklist Before Any Acquisition

🚫 What NOT To Do

📊 Current Corpus Status

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

📜 Greek Text Acquisition Pipeline

Greek Text Acquisition Pipeline

⚠️ CRITICAL: robots.txt Compliance

❌ DO NOT SCRAPE (robots.txt disallows)

✅ APPROVED SOURCES (Open/API/Download)

📚 Approved Source Details

1. GitHub Repositories ⭐ (PRIMARY)

2. Archive.org (Public Domain Loebs)

3. Wikisource Greek Texts

🎯 Acquisition Priorities

Tier 1: Medical/Pharmacological

Tier 2: Philosophy

🛠️ Pipeline Architecture

📋 Checklist Before Any Acquisition

🚫 What NOT To Do

📊 Current Corpus Status

🔗 Related

Graph View

Table of Contents

Backlinks