ComancheNLP

Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language

Authors: Jesus Alvarez C, Daua Karajeanes, Ashley Prado, John Ruttan, Ivory Yang, Sean O’Brien, Vasu Sharma, Kevin Zhu

Explore how we accelerate Comanche NLP by combining synthetic text pipelines and language ID to overcome data scarcity in endangered languages.

🚀 Clone the Repo

git clone https://github.com/comanchegenerate/ComancheSynthetic.git
cd ComancheSynthetic

Datasets/: 412 phrase Comanche-English corpus, the first for this language.
comanche_synthetic_generation.py: Generate validated synthetic Comanche text via GPT-4 few-shot prompting.
language_identification.ipynb: Language identification experimentation showing effectiveness of few-shot examples on increasing accuracy.

Feedback and pull requests welcome!