Aligning Embedding Spaces Across Languages to Identify Word Level Equivalents in the Context of Concreteness and Emotion
Poster | OpenReview | BibTeX
Josephine Kaminaga *, Jingyi Wu *, Daniel Yeung *, Simon Todd
*euqal contribution, authors listed in alphabetical order
- Clone the repo
git clone https://github.com/Jenniebn/wordLevelTrans.git
cd wordLevelTrans
- Environment
conda create -n wordLevelTrans python=3.10 -y
conda activate wordLevelTrans
- Install dependencies
pip install -r requirements.txt
- Download models and data
pip install --upgrade gdown && bash ./download.sh
This downloads all the preprocessed data and model checkpoint that produces the paper results. Note that some of the files in data/ are preprocessed from pretrained Tencent AI Lab embeddings, which are no longer available publicly. However, we provide them as downloadable here. We also provide the one after cutting out the infrequent words, which reduces the file size from approximately 1.7GB to 80MB.
mkdir -p tencent
cd tencent
Go to the Google Drive and download the files.
cd ..
- Run code on the provided test set
python code/test.py --conf_path confs/test.yaml
Find the output in ./output/timestamped. To translate any English word in the golden set, run following:
python code/test.py --conf_path confs/test.yaml --en_word cantaloupe
Note the command will throw an error if the provided English word is not inside the data/golden_set.json. Use SHIFT + OPTION + F to fomat the JSON file for Mac or ALT + SHIFT + F for Windows to see the list of translatable English words.
- To see descriptions for the available flags, run
python code/train.py --conf_path confs/train.yaml --help_config
-
code/: the Python code to train and test the English to Chinese Encoder Decoder model. -
confs/configuration files for training and translation tasks. -
dictionary/: all the scripts needed to scrape and process the 4 online dictionaries (Cambridge, MDBG, Yabla, and Facebook AI) we used.dictionary/golden_set: notebooks used to combine the four individual translation dictionaries into the full, final dictionary we used for model training, and the translation dictionary itself.dictionary/**/* Scrapers: Directories containing notebooks used to scrape online dictionaries.dictionary/**/JSON Data: the unique English-Mandarin translations scraped from that dictionary.
-
notebooks/: the original development Google Colab notebooks, which you could use to run the same thing as the code incode/. Note though you need to set up the folder structure. -
static/: static files.
To run the scripts in dictionary/, please expect taking around 8 hours to fully execute as they each request around 100k webpages from various online dictionaries.
@inproceedings{kaminaga2025aligning,
title={Aligning Embedding Spaces Across Languages to Identify Word Level Equivalents in the Context of Concreteness and Emotion},
author={Josephine Kaminaga and Jennie Wu and Daniel Yeung and Simon Todd},
booktitle={2025 Meeting of the Society for Computation in Linguistics},
year={2025},
url={https://openreview.net/forum?id=pFJXNDwxrQ}
}We'd like to thank Professor Simon Todd and Mary Bucholtz for their insightful instructions.
This repository is based on relreps from Moschella et al. and Tecent AI Lab.
This work is funded by UCSB URCA Grant FR-2023-918.
