|
1 | | -URIEL+:Expanding Feature Coverage and Improving Usability of URIEL |
2 | | -====== |
3 | | -URIEL+ is expanded upon the original URIEL knowledge base from the paper [(Littell et al., EACL 2017)](https://aclanthology.org/E17-2002/), focusing on describing languages distances through typological vectors, which has been cited over 200+ times. This expansion addresses previous limitaions in feature coverage and usability, particularly focusing on improving support for low-resource languages. |
4 | | - |
5 | | -Key Features |
6 | | ------------ |
7 | | -1. **Improved Language Coverage:** URIEL+ integrates five additional databases, including Grambank, BDPROTO, APiCS, and eWAVE, significantly enhancing its typological feature coverage and adds new orphological data for nearly 2500 languages. |
8 | | -2. **Advanced Imputation Method:** alongside the original kNN imputation, URIEL+ provided interface for MIDASpy and SoftImpute, providing more accurate imputed data for missing values. |
9 | | -3. **Customizable Fetaure Selection:** users can choose or exclude specific features when calculating linguistic distances, allowing for tailed analyses depending on your usecase. |
10 | | -4. **Improved Usability:** Instead do precomputed distances, URIEL+ computes distances dynamically ensuring they reflect the most current data. Each calculated distance is also accompnaied by a confidence score, helping users assess the reliability of the results. |
11 | | - |
12 | | -Applications |
13 | | ------------- |
14 | | -URIEL+ has evaluated across several downstream tasks, including performance prdiction(PerfPred, ProxyLM), transfer language selection (LangRank), and typological feature-driven language analysis (LinguAlchemy), where it always demonstrates a performance on par to URIEL, if not better. |
15 | | - |
16 | | - |
17 | | - |
18 | | -Installation |
19 | | ------------- |
20 | | -The data are store in an `npz` file format, which comes out to be larger than github's size limit. Hence you will have to manually run the `update_ALL()` function in URIEL to First clone the repository and run `setup.py`. |
21 | | -~~~ |
22 | | -git clone https://github.com/Masonshipton25/URIELPlusPlus |
23 | | -cd URIELPlusPlus |
24 | | -python3 setup.py install |
25 | | -~~~ |
26 | | -Important Note: Upon cloning this repo locally, then immediately under your repository root, make sure to download glottolog v5.0 from https://zenodo.org/records/10804357 |
| 1 | +# [URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base](https://arxiv.org/abs/2409.18472) |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies. |
| 6 | + |
| 7 | +If you are interested for more information, check out our [full paper](https://arxiv.org/abs/2409.18472). |
| 8 | + |
| 9 | +## Contents |
| 10 | + |
| 11 | ++ [Environment](#environment) |
| 12 | ++ [Setup Instruction](#setup-instruction) |
| 13 | ++ [Configuration Options Examples](#configuration-options-examples) |
| 14 | ++ [Retrieving Loaded Features Examples](#retrieving-loaded-features-examples) |
| 15 | ++ [Database Integration Examples](#database-integration-examples) |
| 16 | ++ [Imputation Examples](#imputation-examples) |
| 17 | ++ [Language Distance Calculation Examples](#language-distance-calculation-examples) |
| 18 | ++ [Citation](#citation) |
| 19 | + |
| 20 | +## Environment |
| 21 | + |
| 22 | +Python 3.10.4 or higher. Details of dependencies are in `requirements.txt`. |
| 23 | + |
| 24 | +## Setup Instruction |
| 25 | + |
| 26 | ++ To get started with URIEL+: |
| 27 | + ```bash |
| 28 | + pip install urielplus |
| 29 | + ``` |
| 30 | + |
| 31 | + ```python |
| 32 | + from urielplus.urielplus import URIELPlus |
| 33 | +
|
| 34 | + u = URIELPlus() |
| 35 | + ``` |
| 36 | + |
| 37 | +## Configuration Options Examples |
| 38 | + |
| 39 | ++ URIEL+ offers various configurations that you can adjust: |
| 40 | + - Caching: Enable or disable caching (True or False). |
| 41 | + - Aggregation Method: Choose the method for aggregating data across sources ('U' for unweighted, 'A' for weighted). |
| 42 | + - Fill Missing Data: Decide whether to fill missing data using parent language data (True or False). |
| 43 | + - Distance Metric: Specify the distance metric to be used ("angular" or "cosine"). |
| 44 | + |
| 45 | ++ Changing A Configuration: |
| 46 | + ```python |
| 47 | + u.set_{configuration}({option}) |
| 48 | + ``` |
| 49 | + |
| 50 | ++ Checking A Configuration: |
| 51 | + ```python |
| 52 | + u.get_{configuration}({option}) |
| 53 | + ``` |
| 54 | + |
| 55 | ++ Replace `{configuration}` with `cache`, `aggregation`, `fill_with_base_lang`, or `distance_metric`. |
| 56 | ++ Replace `{option}` with your desired value for the selected configuration. |
| 57 | ++ Note: the default configurations are `cache=False`, `aggregation='U'`, `fill_with_base_lang=True`, and `distance_metric="angular"`. |
| 58 | + |
| 59 | +## Retrieving Loaded Features Examples |
| 60 | + |
| 61 | ++ Retrieving A Loaded Feature: |
| 62 | + ```python |
| 63 | + u.get_{vector_type}_{feature_type}_array() |
| 64 | + ``` |
| 65 | ++ Replace `{vector_type}` with `phylogeny`, `typological`, or `geography`. |
| 66 | ++ Replace `{feature_type}` with `features`, `languages`, `data`, or `sources`. |
| 67 | + |
| 68 | ++ Example: |
| 69 | + ```python |
| 70 | + u.get_typological_languages_array() |
| 71 | + ``` |
| 72 | + |
| 73 | +## Database Integration Examples |
| 74 | + |
| 75 | ++ Integrating One Database: |
| 76 | + ```python |
| 77 | + u.integrate_{database}() |
| 78 | + ``` |
| 79 | ++ Integrating Some Databases: |
| 80 | + ```python |
| 81 | + u.integrate_custom_databases({databases}) |
| 82 | + ``` |
| 83 | ++ Integrating All Databases: |
| 84 | + ```python |
| 85 | + u.integrate_databases() |
| 86 | + ``` |
| 87 | ++ Set Language Codes to Glottocodes: |
| 88 | + ```python |
| 89 | + u.set_glottocodes() |
| 90 | + ``` |
| 91 | ++ Reset all changes: |
| 92 | + ```python |
| 93 | + u.reset() |
| 94 | + ``` |
| 95 | + |
| 96 | ++ Replace `{database}` with `saphon`, `bdproto`, `grambank`, `apics`, or `ewave`. |
| 97 | ++ Replace `{databases}` with arguments `"UPDATED_SAPHON"`, `"BDPROTO"`, `"GRAMBANK"`, `"APICS"`, and/or `"EWAVE"` (e.g., `"UPDATED_SAPHON"`, `"BDPROTO"`, `"EWAVE"`). |
| 98 | + |
| 99 | +## Imputation Examples |
| 100 | + |
| 101 | ++ Aggregate Typological Data: |
| 102 | + ```python |
| 103 | + u.set_aggregation({aggregation}) |
| 104 | + u.aggregate() |
| 105 | + ``` |
| 106 | + |
| 107 | ++ Impute Missing Values: |
| 108 | + ```python |
| 109 | + u.{imputation_strategy}_imputation() |
| 110 | + ``` |
| 111 | + |
| 112 | ++ Replace `{aggregation}` with `'U'` (union) or `'A'` (average). |
| 113 | ++ Replace `{imputation_strategy}` with `midaspy`, `knn`, `softimpute`, or `mean`. |
| 114 | + |
| 115 | +## Language Distance Calculation Examples |
| 116 | + |
| 117 | ++ Calculate a Specific Distance: |
| 118 | + ```python |
| 119 | + print(u.new_distance({distance_type}, {languages})) |
| 120 | + ``` |
| 121 | + |
| 122 | ++ Calculate Distance Using Specific Features: |
| 123 | + ```python |
| 124 | + print(u.new_custom_distance({features}, {languages}, {source})) |
| 125 | + ``` |
| 126 | + |
| 127 | ++ Retrieve Language Vectors: |
| 128 | + ```python |
| 129 | + u.get_vector({distance_type}, {languages}) |
| 130 | + ``` |
| 131 | + |
| 132 | ++ View URIEL+ Feature Coverage: |
| 133 | + ```python |
| 134 | + u.feature_coverage() |
| 135 | + ``` |
| 136 | + |
| 137 | ++ Calculate Confidence Scores for Distances |
| 138 | + ```python |
| 139 | + print(u.confidence_score({language 1}, {language 2}, {distance_type})) |
| 140 | + ``` |
| 141 | + |
| 142 | ++ Replace `{distance_type}` with a distance type (e.g., `"featural"`) or a list (e.g., `["syntactic"`, `"phonological"]`). Must be single distance type for retrieving language vectors. |
| 143 | ++ Replace `{features}` with a list of features (e.g., `["F_Germanic", "S_SVO", "P_NASAL_VOWELS"]`). |
| 144 | ++ Replace `{languages}`, `{language 1}`, and `{language 2}` with language codes (e.g., `"stan1293"`, `"hind1269"`). |
| 145 | ++ Replace `{source}` with one database (e.g., `"WALS"`) or all databases (`'A'`). |
| 146 | ++ Note: the default `{source}` is all databases. |
| 147 | + |
| 148 | +## Citation |
| 149 | + |
| 150 | +<u>If you use this code for your research, please cite the following work:</u> |
| 151 | + |
| 152 | +```bibtex |
| 153 | +@article{khan2024urielplus, |
| 154 | + title={URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base}, |
| 155 | + author={Khan, Aditya and Shipton, Mason and Anugraha, David and Duan, Kaiyao and Hoang, Phuong H. and Khiu, Eric and Doğruöz, A. Seza and Lee, En-Shiun Annie}, |
| 156 | + journal={arXiv preprint arXiv:2409.18472}, |
| 157 | + year={2024} |
| 158 | +} |
| 159 | +``` |
| 160 | + |
| 161 | +If you have any questions, you can open a [GitHub Issue](https://github.com/Masonshipton25/URIELPlus/issues) or send us an [email](mailto:masonshipton25@gmail.com). |
0 commit comments