A comprehensive Python package for morphological analysis combining derivational stemming, inflectional analysis, and cross-lingual etymology.
- Rust-accelerated derivational stemming backend via PyO3
- Automatic fallback to pure-Python derivational logic when Rust extension is unavailable
- Backend parity coverage for
stem,get_derivations, andget_word_family - Updated benchmark harness with active-backend vs Python-fallback comparisons
- Production-stable
1.0.0packaging metadata
Crosstem finds true linguistic roots across part-of-speech boundaries, which is something traditional stemmers and lemmatizers cannot do.
# Traditional stemmers (Porter, Lancaster) - Rule-based, prone to errors
Porter: "organization" → "organ" # Overstemming loses meaning
# Lemmatizers (WordNet, spaCy) - Only handle inflections, not derivations
WordNet: "organization" → "organization" # Can't cross POS boundaries
WordNet: "beautiful" → "beautiful" # Stuck at adjective form
# Crosstem - Linguistically accurate, crosses POS boundaries
Crosstem: "organization" → "organize" # Noun → Verb (true root)
Crosstem: "beautiful" → "beauty" # Adjective → Noun (semantic base)- Cross-POS derivational stemming: Only library that finds roots across parts of speech
- Linguistic accuracy: Uses MorphyNet morphological data, not brittle rules
- Etymology tracing: 4.2M relationships across 2,265 languages (unique feature)
- Word families: Discover complete derivational networks (e.g., organize → 43 related words)
- Fast hybrid runtime: Rust-accelerated derivational engine with automatic pure-Python fallback
- 15 languages: Multilingual morphology support out of the box
We compared Crosstem against the widely-used Porter stemmer on 44 English words with 1,000 iterations each.
Crosstem: ~0.036s (~1,217,000 words/sec)
Porter: ~0.490s (~90,000 words/sec)
⚡ Crosstem is ~13× FASTER than Porter
Why? Crosstem uses O(1) hash lookups in JSON dictionaries, while Porter applies sequential pattern-matching rules.
Note: Results averaged over multiple runs; ±3% variance is normal due to system load.
| Word | Crosstem | Porter | Winner |
|---|---|---|---|
| organization | organize | organ | ✅ Crosstem (finds true root) |
| organizational | organize | organiz | ✅ Crosstem (multi-hop) |
| beautiful | beauty | beauti | ✅ Crosstem (crosses POS) |
| destruction | destruct | destruct | ⚖️ Tie |
| democracy | democracy | democraci | ✅ Crosstem (avoids error) |
| computerization | compute | computer | ✅ Crosstem (deeper root) |
| happiness | happy | happi | ✅ Crosstem (productivity filter avoids "hap") |
| redness | red | red | ⚖️ Tie |
Key Findings:
- Cross-POS stemming: Crosstem finds roots across parts of speech (
organization→organize, verb), Porter cannot - Overstemming prevention: Porter creates non-words (
beauti,organiz), Crosstem always produces real words - Data quality: Crosstem filters bad roots (
democrat), Porter has no quality control - Multi-hop: Crosstem traverses multiple derivations (
organizational→organization→organize), Porter only strips one suffix
Choose Crosstem when:
- ✅ Need linguistically accurate roots
- ✅ Working with derivational families (organize/organizer/organization)
- ✅ Building semantic search, clustering, or word embeddings
- ✅ Quality matters more than simplicity
- ✅ Multilingual support needed (15 languages)
Choose Porter when:
- ✅ Legacy system compatibility required
- ✅ Working with noisy/misspelled text (rule-based is robust)
- ✅ Only need basic suffix normalization
- ✅ Want the absolute simplest possible solution
Note: Crosstem is now faster than Porter while being more accurate, making it the better choice for most modern NLP applications.
- Derivational Stemming: Find roots across part-of-speech boundaries (organization → organize)
- Inflectional Analysis: Lemmatization and grammatical forms (running → run)
- Cross-Lingual Etymology: Trace word origins across 2,265 languages
- Word Family Analysis: Discover complete derivational networks
pip install crosstemcrosstem now supports an accelerated Rust derivational backend (PyO3).
When a prebuilt wheel includes the extension, it is used automatically.
If the Rust extension is unavailable, Crosstem falls back to the pure-Python
derivational implementation.
pip install maturin
maturin develop --manifest-path rust/Cargo.tomlTo force the pure-Python path for comparison/debugging:
from crosstem import DerivationalStemmer
stemmer = DerivationalStemmer("eng", use_rust_backend=False)Etymology features require additional data (~1 GB) that's downloaded separately:
from crosstem import download_etymology
# One-time download (saves to package data directory)
download_etymology()Or from command line:
python -m crosstem.downloadfrom crosstem import MorphologyAnalyzer
# Works immediately - no etymology needed
analyzer = MorphologyAnalyzer('eng', load_etymology=False)
result = analyzer.analyze('organizations')
print(result['derivational_stem']) # 'organize'
print(result['inflectional_lemma']) # 'organization'from crosstem import MorphologyAnalyzer, download_etymology
# Download etymology data first (one-time)
if not MorphologyAnalyzer.is_etymology_available():
download_etymology()
# Now etymology features work
analyzer = MorphologyAnalyzer('eng', load_etymology=True)
result = analyzer.analyze('portmanteau')
print(result['etymology']) # Shows Middle French originfrom crosstem import DerivationalStemmer
stemmer = DerivationalStemmer('eng')
stemmer.stem('organization') # 'organize'
stemmer.stem('beautiful') # 'beauty'
family = stemmer.get_word_family('organize')
# ['organize', 'organizer', 'organization', ...]from crosstem import InflectionAnalyzer
inflector = InflectionAnalyzer('eng')
inflector.get_lemma('running') # 'run'
forms = inflector.get_inflections('run')
# [{'form': 'runs', 'pos': 'V', ...}, ...]from crosstem import EtymologyLinker, download_etymology
# Download etymology data first (one-time, ~1 GB)
download_etymology()
linker = EtymologyLinker()
chain = linker.trace_origin_chain('portmanteau', 'English')
# [{'term': 'portmanteau', 'lang': 'English'},
# {'term': 'portemanteau', 'lang': 'Middle French', ...}]15 languages with derivational morphology:
- English (eng), Russian (rus), French (fra), German (deu), Spanish (spa)
- Portuguese (por), Italian (ita), Polish (pol), Czech (ces)
- Serbo-Croatian (hbs), Hungarian (hun), Finnish (fin)
- Swedish (swe), Mongolian (mon), Catalan (cat)
Plus 2,265 languages with etymology data.
Crosstem is built on three pillars of morphological linguistics:
Unlike inflection (which modifies words grammatically), derivation creates new words by adding affixes or converting between parts of speech:
organize+-ation→organization(verb → noun)beauty+-ful→beautiful(noun → adjective)organize+-er→organizer(agent noun)
Crosstem models this as a directed graph where:
- Nodes = word forms with POS tags
- Edges = derivational relationships (affixes, conversions)
- Stemming = graph traversal to find the root (preferring verbs and shorter forms)
┌─────────────┐
│ organize │ ← ROOT (verb, shortest)
│ (V) │
└──────┬──────┘
┌────┴────┬───────┬──────────┐
▼ ▼ ▼ ▼
organizer organization organized reorganize
(N) (N) (ADJ) (V)
│
▼
organizational
(ADJ)
This graph-based approach ensures linguistically accurate roots, avoiding the overstemming problem of rule-based stemmers.
Inflection expresses grammatical categories without changing core meaning:
- Number:
cat→cats - Tense:
run→ran,running - Comparison:
good→better,best
Crosstem stores inflectional paradigms as lemma → forms mappings:
{
"run": {
"pos": "V",
"forms": {
"runs": [{"pos": "V", "features": "PRS;3;SG"}],
"running": [{"pos": "V", "features": "V.PTCP;PRS"}],
"ran": [{"pos": "V", "features": "PST"}]
}
}
}This enables both lemmatization (running → run) and paradigm generation (run → all forms).
Etymology traces how words evolve and transfer across languages:
- Borrowing: English
portmanteau← Middle Frenchportemanteau - Cognates: Dutch
woordenboek↔ GermanWörterbuch(shared Germanic ancestor) - Inheritance: Latin
mater→ Frenchmère, Italianmadre, Spanishmadre
Crosstem represents this as a multilingual graph with typed edges:
English: "portmanteau" ──borrowed_from──→ Middle French: "portemanteau"
│
has_root: "porter" (to carry)
│
has_root: "manteau" (coat)
All three frameworks are implemented as fast JSON lookups with graph traversal algorithms:
- Preprocessing: TSV/CSV data → optimized JSON dictionaries
- Indexing: Multi-level indices (word → derivations, lemma → inflections, term+lang → etymology)
- Traversal: BFS for word families, chain-following for etymology
- Filtering: POS preference (verbs), length minimization, cycle detection
Result: ~0.0008ms per-word stemming on benchmark runs with Rust acceleration enabled.
Crosstem uses a sophisticated breadth-first search algorithm to find the optimal root:
# Example: organizational → organization → organize
organizational (14 chars, ADJ)
↓ (depth 1)
organization (12 chars, N, productivity=16) ← candidate
↓ (depth 2)
organize (8 chars, V, productivity=13) ← BEST (verb, shortest)Algorithm steps:
- Start from input word, add to queue
- Expand all DERIVED_FROM relationships (parents in morphology graph)
- Score each candidate:
- Length (shorter is better)
- POS (verbs score -10, nouns -5)
- Depth (penalize by +2 per hop)
- Filter by productivity threshold (language-specific):
- English: Verbs ≥5, Others ≥9
- French/Italian: Verbs ≥4, Others ≥5
- German: Verbs ≥4, Others ≥3 (compound-heavy)
- Spanish/Portuguese: Verbs ≥3, Others ≥4
- Russian/Slavic: Verbs ≥3, Others ≥2-3 (lower productivity)
- Continue traversal through low-productivity nodes (enables multi-hop)
- Return lowest-scoring candidate that's shorter than input or a verb
Problem: MorphyNet contains archaic roots (e.g., hap) and data errors (e.g., democracy → democrat)
Solution: Use productivity as a quality signal. Words with many derivations are more likely to be modern, correct roots.
Examples (English thresholds: V≥5, N≥9):
| Word | Productivity | POS | Threshold | Result |
|---|---|---|---|---|
red |
18 derivations | N | ≥9 | ✅ PASS (productive noun) |
run |
33 derivations | V | ≥5 | ✅ PASS (very productive verb) |
destruct |
6 derivations | V | ≥5 | ✅ PASS (verb threshold) |
hap |
8 derivations | N | ≥9 | ❌ FILTERED (archaic) |
democrat |
7 derivations | N | ≥9 | ❌ FILTERED (data error) |
This data-driven approach avoids hard-coded rules while maintaining quality.
Language-Specific Calibration: Thresholds are adjusted for each language based on morphological richness. Languages with lower overall productivity (Russian, Spanish) use lower thresholds to avoid over-filtering, while English uses higher thresholds due to rich derivational data.
Traditional stemmers fail because they use brittle suffix rules:
# Porter stemmer rule: -ation → (remove suffix)
"organization" → "organ" # Lost the "ize" (overstemming)
# Lancaster stemmer: even more aggressive
"organization" → "org" # Completely loses meaningCrosstem succeeds because it uses linguistic knowledge:
# Graph data knows: organization DERIVED_FROM organize
"organization" → "organize" # Preserves semantic relationship- MorphyNet v1.0: Derivational and inflectional morphology (CC BY-SA 4.0)
- Wiktionary: Cross-lingual etymology data (CC BY-SA 3.0)
If you use this library in your research, please cite:
@software{crosstem2025,
title={Crosstem: Comprehensive Morphological Analysis for Python},
author={Avinash Bhojanapalli},
year={2025},
url={https://github.com/droidmaximus/crosstem},
note={A Python package for derivational stemming, inflectional analysis, and cross-lingual etymology}
}
@inproceedings{batsuren2021morphynet,
title={MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology},
author={Batsuren, Khuyagbaatar and Bella, Gábor and Giunchiglia, Fausto},
booktitle={Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology},
pages={39--48},
year={2021}
}
@misc{wiktionary2025,
title={Wiktionary, The Free Dictionary},
author={{Wiktionary contributors}},
year={2025},
url={https://en.wiktionary.org/},
note={Etymology data extracted from Wiktionary dumps}
}- Code: MIT License
- Data: CC BY-SA 4.0 (MorphyNet), CC BY-SA 3.0 (Wiktionary)