Skip to content

Commit 90f3a43

Browse files
committed
Release v1.0
1 parent 636f5f9 commit 90f3a43

43 files changed

Lines changed: 3332 additions & 4627 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# This workflow uses actions that are not certified by GitHub.
2+
# They are provided by a third-party and are governed by
3+
# separate terms of service, privacy policy, and support
4+
# documentation.
5+
6+
# GitHub recommends pinning actions to a commit SHA.
7+
# To get a newer version, you will need to update the SHA.
8+
# You can also reference a tag or branch, but the action may change without warning.
9+
10+
name: Upload Python Package
11+
12+
on:
13+
release:
14+
types: [published]
15+
16+
permissions:
17+
contents: read
18+
19+
jobs:
20+
release-build:
21+
runs-on: ubuntu-latest
22+
23+
steps:
24+
- uses: actions/checkout@v4
25+
26+
- uses: actions/setup-python@v5
27+
with:
28+
python-version: "3.x"
29+
30+
- name: Build release distributions
31+
run: |
32+
# NOTE: put your own distribution build steps here.
33+
python -m pip install build
34+
python -m build
35+
36+
- name: Upload distributions
37+
uses: actions/upload-artifact@v4
38+
with:
39+
name: release-dists
40+
path: dist/
41+
42+
pypi-publish:
43+
runs-on: ubuntu-latest
44+
45+
needs:
46+
- release-build
47+
48+
permissions:
49+
# IMPORTANT: this permission is mandatory for trusted publishing
50+
id-token: write
51+
52+
# Dedicated environments with protections for publishing are strongly recommended.
53+
environment:
54+
name: pypi
55+
# OPTIONAL: uncomment and update to include your PyPI project URL in the deployment status:
56+
url: https://pypi.org/p/urielplus/
57+
58+
steps:
59+
- name: Retrieve release distributions
60+
uses: actions/download-artifact@v4
61+
with:
62+
name: release-dists
63+
path: dist/
64+
65+
- name: Publish release distributions to PyPI
66+
uses: pypa/gh-action-pypi-publish@release/v1

MANIFEST.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include urielplus/database/urielplus_csvs/*.csv
2+
include urielplus/database/original_uriel/*.npz

README.md

Lines changed: 161 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,161 @@
1-
URIEL+:Expanding Feature Coverage and Improving Usability of URIEL
2-
======
3-
URIEL+ is expanded upon the original URIEL knowledge base from the paper [(Littell et al., EACL 2017)](https://aclanthology.org/E17-2002/), focusing on describing languages distances through typological vectors, which has been cited over 200+ times. This expansion addresses previous limitaions in feature coverage and usability, particularly focusing on improving support for low-resource languages.
4-
5-
Key Features
6-
-----------
7-
1. **Improved Language Coverage:** URIEL+ integrates five additional databases, including Grambank, BDPROTO, APiCS, and eWAVE, significantly enhancing its typological feature coverage and adds new orphological data for nearly 2500 languages.
8-
2. **Advanced Imputation Method:** alongside the original kNN imputation, URIEL+ provided interface for MIDASpy and SoftImpute, providing more accurate imputed data for missing values.
9-
3. **Customizable Fetaure Selection:** users can choose or exclude specific features when calculating linguistic distances, allowing for tailed analyses depending on your usecase.
10-
4. **Improved Usability:** Instead do precomputed distances, URIEL+ computes distances dynamically ensuring they reflect the most current data. Each calculated distance is also accompnaied by a confidence score, helping users assess the reliability of the results.
11-
12-
Applications
13-
------------
14-
URIEL+ has evaluated across several downstream tasks, including performance prdiction(PerfPred, ProxyLM), transfer language selection (LangRank), and typological feature-driven language analysis (LinguAlchemy), where it always demonstrates a performance on par to URIEL, if not better.
15-
16-
17-
18-
Installation
19-
------------
20-
The data are store in an `npz` file format, which comes out to be larger than github's size limit. Hence you will have to manually run the `update_ALL()` function in URIEL to First clone the repository and run `setup.py`.
21-
~~~
22-
git clone https://github.com/Masonshipton25/URIELPlusPlus
23-
cd URIELPlusPlus
24-
python3 setup.py install
25-
~~~
26-
Important Note: Upon cloning this repo locally, then immediately under your repository root, make sure to download glottolog v5.0 from https://zenodo.org/records/10804357
1+
# [URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base](https://arxiv.org/abs/2409.18472)
2+
3+
![knowledge base for natural language processing](./logo.png)
4+
5+
URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
6+
7+
If you are interested for more information, check out our [full paper](https://arxiv.org/abs/2409.18472).
8+
9+
## Contents
10+
11+
+ [Environment](#environment)
12+
+ [Setup Instruction](#setup-instruction)
13+
+ [Configuration Options Examples](#configuration-options-examples)
14+
+ [Retrieving Loaded Features Examples](#retrieving-loaded-features-examples)
15+
+ [Database Integration Examples](#database-integration-examples)
16+
+ [Imputation Examples](#imputation-examples)
17+
+ [Language Distance Calculation Examples](#language-distance-calculation-examples)
18+
+ [Citation](#citation)
19+
20+
## Environment
21+
22+
Python 3.10.4 or higher. Details of dependencies are in `requirements.txt`.
23+
24+
## Setup Instruction
25+
26+
+ To get started with URIEL+:
27+
```bash
28+
pip install urielplus
29+
```
30+
31+
```python
32+
from urielplus.urielplus import URIELPlus
33+
34+
u = URIELPlus()
35+
```
36+
37+
## Configuration Options Examples
38+
39+
+ URIEL+ offers various configurations that you can adjust:
40+
- Caching: Enable or disable caching (True or False).
41+
- Aggregation Method: Choose the method for aggregating data across sources ('U' for unweighted, 'A' for weighted).
42+
- Fill Missing Data: Decide whether to fill missing data using parent language data (True or False).
43+
- Distance Metric: Specify the distance metric to be used ("angular" or "cosine").
44+
45+
+ Changing A Configuration:
46+
```python
47+
u.set_{configuration}({option})
48+
```
49+
50+
+ Checking A Configuration:
51+
```python
52+
u.get_{configuration}({option})
53+
```
54+
55+
+ Replace `{configuration}` with `cache`, `aggregation`, `fill_with_base_lang`, or `distance_metric`.
56+
+ Replace `{option}` with your desired value for the selected configuration.
57+
+ Note: the default configurations are `cache=False`, `aggregation='U'`, `fill_with_base_lang=True`, and `distance_metric="angular"`.
58+
59+
## Retrieving Loaded Features Examples
60+
61+
+ Retrieving A Loaded Feature:
62+
```python
63+
u.get_{vector_type}_{feature_type}_array()
64+
```
65+
+ Replace `{vector_type}` with `phylogeny`, `typological`, or `geography`.
66+
+ Replace `{feature_type}` with `features`, `languages`, `data`, or `sources`.
67+
68+
+ Example:
69+
```python
70+
u.get_typological_languages_array()
71+
```
72+
73+
## Database Integration Examples
74+
75+
+ Integrating One Database:
76+
```python
77+
u.integrate_{database}()
78+
```
79+
+ Integrating Some Databases:
80+
```python
81+
u.integrate_custom_databases({databases})
82+
```
83+
+ Integrating All Databases:
84+
```python
85+
u.integrate_databases()
86+
```
87+
+ Set Language Codes to Glottocodes:
88+
```python
89+
u.set_glottocodes()
90+
```
91+
+ Reset all changes:
92+
```python
93+
u.reset()
94+
```
95+
96+
+ Replace `{database}` with `saphon`, `bdproto`, `grambank`, `apics`, or `ewave`.
97+
+ Replace `{databases}` with arguments `"UPDATED_SAPHON"`, `"BDPROTO"`, `"GRAMBANK"`, `"APICS"`, and/or `"EWAVE"` (e.g., `"UPDATED_SAPHON"`, `"BDPROTO"`, `"EWAVE"`).
98+
99+
## Imputation Examples
100+
101+
+ Aggregate Typological Data:
102+
```python
103+
u.set_aggregation({aggregation})
104+
u.aggregate()
105+
```
106+
107+
+ Impute Missing Values:
108+
```python
109+
u.{imputation_strategy}_imputation()
110+
```
111+
112+
+ Replace `{aggregation}` with `'U'` (union) or `'A'` (average).
113+
+ Replace `{imputation_strategy}` with `midaspy`, `knn`, `softimpute`, or `mean`.
114+
115+
## Language Distance Calculation Examples
116+
117+
+ Calculate a Specific Distance:
118+
```python
119+
print(u.new_distance({distance_type}, {languages}))
120+
```
121+
122+
+ Calculate Distance Using Specific Features:
123+
```python
124+
print(u.new_custom_distance({features}, {languages}, {source}))
125+
```
126+
127+
+ Retrieve Language Vectors:
128+
```python
129+
u.get_vector({distance_type}, {languages})
130+
```
131+
132+
+ View URIEL+ Feature Coverage:
133+
```python
134+
u.feature_coverage()
135+
```
136+
137+
+ Calculate Confidence Scores for Distances
138+
```python
139+
print(u.confidence_score({language 1}, {language 2}, {distance_type}))
140+
```
141+
142+
+ Replace `{distance_type}` with a distance type (e.g., `"featural"`) or a list (e.g., `["syntactic"`, `"phonological"]`). Must be single distance type for retrieving language vectors.
143+
+ Replace `{features}` with a list of features (e.g., `["F_Germanic", "S_SVO", "P_NASAL_VOWELS"]`).
144+
+ Replace `{languages}`, `{language 1}`, and `{language 2}` with language codes (e.g., `"stan1293"`, `"hind1269"`).
145+
+ Replace `{source}` with one database (e.g., `"WALS"`) or all databases (`'A'`).
146+
+ Note: the default `{source}` is all databases.
147+
148+
## Citation
149+
150+
<u>If you use this code for your research, please cite the following work:</u>
151+
152+
```bibtex
153+
@article{khan2024urielplus,
154+
title={URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base},
155+
author={Khan, Aditya and Shipton, Mason and Anugraha, David and Duan, Kaiyao and Hoang, Phuong H. and Khiu, Eric and Doğruöz, A. Seza and Lee, En-Shiun Annie},
156+
journal={arXiv preprint arXiv:2409.18472},
157+
year={2024}
158+
}
159+
```
160+
161+
If you have any questions, you can open a [GitHub Issue](https://github.com/Masonshipton25/URIELPlus/issues) or send us an [email](mailto:masonshipton25@gmail.com).

0 commit comments

Comments
 (0)