Cross-Domain Lexicon System
Custom lexicon built from multi-corpus entity extraction. Recursive refinement captures expert-level domain intelligence through iterative enrichment.
๐ฏ Purpose
Traditional lexicons are static. Ours grows through:
- Multi-corpus extraction โ Terms emerge from actual usage
- Cross-domain mapping โ Same term, different meanings
- Recursive refinement โ Each pass adds context
- Expert system capture โ Implicit knowledge made explicit
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CROSS-DOMAIN LEXICON PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ greek_ โ โ science_ โ โphysicistsโ โ osint_ โ โ
โ โ corpus โ โ corpus โ โ _corpus โ โ corpus โ โ
โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โ
โ โ โ โ โ โ
โ โโโโโโโโโโโโโโโดโโโโโโโฌโโโโโโโดโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ ENTITY EXTRACTION โ โ
โ โ (Pass 1: Raw terms) โ โ
โ โโโโโโโโโโโโโฌโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ CROSS-REFERENCE โ โ
โ โ (Find term overlaps) โ โ
โ โโโโโโโโโโโโโฌโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ RECURSIVE REFINEMENT โโโโโโโโโ โ
โ โ (Pass N: Add context) โ โ โ
โ โโโโโโโโโโโโโฌโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ (iterate) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ MASTER LEXICON โ โ
โ โ (ChromaDB + JSON) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Lexicon Schema
Term Entry Structure
{
"term": "mitochondria",
"canonical": "mitochondrion",
"domains": {
"quantum_biology": {
"frequency": 4550,
"context": "quantum coherence, electron transport, biophoton emission",
"related_terms": ["ATP", "ETC", "CCO", "heteroplasmy"],
"sample_chunks": ["chunk_id_1", "chunk_id_2"]
},
"greek_medical": {
"frequency": 0,
"greek_equivalent": null,
"note": "Ancient Greeks had no microscopy; humoral model instead"
},
"physicists": {
"frequency": 45,
"context": "cellular energy, thermodynamics",
"related_terms": ["entropy", "dissipative structures"]
}
},
"cross_domain_notes": "Bridge between ancient humoral theory (ฯฯ
ฮผฯฯ) and modern bioenergetics",
"refinement_passes": 3,
"confidence": 0.92,
"last_updated": "2026-02-02"
}Cross-Reference Entry
{
"mapping_id": "pharmakon_mitochondria_001",
"greek_term": "ฯฮฌฯฮผฮฑฮบฮฟฮฝ",
"modern_terms": ["drug", "medicine", "pharmaceutical", "toxin"],
"mechanism_bridge": {
"ancient_concept": "substance that heals or harms",
"modern_mechanism": "mitochondrial modulation, receptor binding",
"quantum_angle": "electron transport chain interference"
},
"corpus_evidence": {
"greek_corpus": ["chunk_123", "chunk_456"],
"science_corpus": ["chunk_789"],
"physicists_corpus": ["chunk_012"]
},
"confidence": 0.78,
"refinement_history": [
{"pass": 1, "date": "2026-02-01", "added": "basic mapping"},
{"pass": 2, "date": "2026-02-02", "added": "mechanism bridge"},
{"pass": 3, "date": "2026-02-02", "added": "quantum angle"}
]
}๐ Recursive Refinement Algorithm
Pass 1: Raw Extraction
def pass_1_extract(corpus: str) -> dict:
"""Extract raw entities from corpus"""
entities = {}
for chunk in get_all_chunks(corpus):
extracted = llm_extract_entities(chunk.text)
for entity in extracted:
if entity not in entities:
entities[entity] = {
"frequency": 0,
"chunks": [],
"contexts": []
}
entities[entity]["frequency"] += 1
entities[entity]["chunks"].append(chunk.id)
entities[entity]["contexts"].append(chunk.text[:200])
return entitiesPass 2: Cross-Domain Mapping
def pass_2_crossmap(all_entities: dict) -> dict:
"""Find same/similar terms across domains"""
mappings = {}
for corpus_a, entities_a in all_entities.items():
for corpus_b, entities_b in all_entities.items():
if corpus_a >= corpus_b:
continue
# Exact matches
overlap = set(entities_a.keys()) & set(entities_b.keys())
# Semantic matches (embedding similarity)
for term_a in entities_a:
similar = find_similar_terms(term_a, entities_b, threshold=0.85)
overlap.update(similar)
for term in overlap:
mappings[term] = {
"domains": [corpus_a, corpus_b],
"frequencies": {
corpus_a: entities_a.get(term, {}).get("frequency", 0),
corpus_b: entities_b.get(term, {}).get("frequency", 0)
}
}
return mappingsPass 3+: Contextual Enrichment
def pass_n_enrich(term: str, current_entry: dict, n: int) -> dict:
"""Recursively enrich term with deeper context"""
# Get all chunks containing term
chunks = get_chunks_containing(term)
# Extract co-occurring entities
cooccurrence = extract_cooccurrence(chunks)
# Ask LLM for deeper analysis
enrichment_prompt = f"""
Term: {term}
Current understanding: {current_entry}
Sample contexts: {chunks[:5]}
Co-occurring terms: {cooccurrence[:20]}
Provide:
1. Refined definition incorporating all domains
2. Key relationships not yet captured
3. Cross-domain bridges (how ancient/modern concepts connect)
4. Confidence assessment
"""
enriched = llm_analyze(enrichment_prompt)
current_entry["refinement_passes"] = n
current_entry["cross_domain_notes"] = enriched["bridges"]
current_entry["related_terms"].extend(enriched["relationships"])
current_entry["confidence"] = enriched["confidence"]
return current_entryIteration Controller
def recursive_refinement(lexicon: dict, max_passes: int = 5) -> dict:
"""Iterate until convergence or max passes"""
for n in range(1, max_passes + 1):
changes = 0
for term, entry in lexicon.items():
old_confidence = entry.get("confidence", 0)
enriched = pass_n_enrich(term, entry, n)
# Check if significantly changed
if abs(enriched["confidence"] - old_confidence) > 0.05:
changes += 1
lexicon[term] = enriched
print(f"Pass {n}: {changes} terms updated")
# Convergence check
if changes < len(lexicon) * 0.01: # <1% changed
print(f"Converged at pass {n}")
break
return lexicon๐ท๏ธ Domain-Specific Lexicons
Quantum Biology Lexicon
| Term | Frequency | Key Context |
|---|---|---|
| mitochondria | 4,550 | Quantum coherence, biophotons |
| melanin | 3,822 | Quantum antenna, evolution |
| deuterium | 2,349 | Kinetic isotope effect |
| heteroplasmy | 403 | mtDNA mutation load |
| proton tunneling | 392 | Grotthuss mechanism |
Greek Medical Lexicon
| Greek | Transliteration | Modern Mapping |
|---|---|---|
| ฯฮฌฯฮผฮฑฮบฮฟฮฝ | pharmakon | drug/toxin (dose-dependent) |
| ฯฯ ฮผฯฯ | chymos | biochemical milieu |
| ฮธฮตฯฮฑฯฮตฮฏฮฑ | therapeia | therapeutic intervention |
| ฮบฯแพถฯฮนฯ | krasis | homeostatic balance |
| ฮดฯฮฝฮฑฮผฮนฯ | dynamis | bioactive potency |
Physics Lexicon
| Term | Frequency | Cross-Domain Bridge |
|---|---|---|
| coherence | 1,200+ | QBio: biological coherence |
| tunneling | 800+ | QBio: enzyme catalysis |
| entropy | 600+ | QBio: negentropy of life |
| field | 2,000+ | QBio: biofield, nnEMF |
๐ Expert System Capture
Implicit Knowledge Extraction
The lexicon captures expert-level knowledge by:
- Co-occurrence patterns โ What experts mention together
- Contextual usage โ How terms are actually used
- Cross-domain bridges โ Connections only experts see
- Confidence gradients โ Which mappings are solid vs speculative
Example: Theriac โ Modern Pharmacology
Ancient term: ฮธฮทฯฮนฮฑฮบฮฎ (thฤriakฤ) - "beast medicine"
โโโ Greek corpus context: antidote to venomous bites, 60+ ingredients
โโโ Cross-reference: opium, viper flesh, botanical compounds
โโโ Modern mapping:
โ โโโ Polypharmacy (multi-compound formulation)
โ โโโ Mithridatism (graduated poison tolerance)
โ โโโ Hormesis (low-dose stimulation)
โโโ Quantum biology angle:
โ โโโ Mitochondrial hormesis
โ โโโ Adaptive stress response
โโโ Confidence: 0.72 (solid historical, speculative mechanism)
๐ Implementation TODO
Phase 1: Foundation
- Build lexicon ChromaDB collection
- Pass 1 extraction on all corpora
- JSON export for inspection
Phase 2: Cross-Mapping
- Implement semantic similarity matching
- GreekโEnglish term mapping
- Build cross-reference index
Phase 3: Recursive Refinement
- LLM enrichment pipeline
- Convergence detection
- Confidence scoring
Phase 4: Expert System
- Query interface for lexicon
- Integration with Pharmakon Miner
- Novel connection discovery
๐ ๏ธ Tools
lexicon_builder.py
#!/usr/bin/env python3
"""Build cross-domain lexicon from corpora"""
class LexiconBuilder:
def __init__(self, chroma_client, corpora: list):
self.client = chroma_client
self.corpora = corpora
self.lexicon = {}
def build(self, max_passes: int = 5):
# Pass 1: Extract from all corpora
for corpus in self.corpora:
self.lexicon[corpus] = self.extract_entities(corpus)
# Pass 2: Cross-map
self.cross_mappings = self.build_cross_map()
# Pass 3+: Recursive refinement
self.master_lexicon = self.recursive_refine(max_passes)
return self.master_lexicon
def export(self, path: str):
with open(path, 'w') as f:
json.dump(self.master_lexicon, f, indent=2)๐ Metrics
| Metric | Target | Current |
|---|---|---|
| Unique terms | 10,000+ | TBD |
| Cross-domain mappings | 2,000+ | TBD |
| GreekโModern bridges | 500+ | TBD |
| Avg confidence | >0.75 | TBD |
| Refinement passes | 3-5 | TBD |
๐ Related
Recursive refinement: each pass makes the lexicon smarter.