GitHub - rudranshtripathi9/Text-Normalization: https://www.kaggle.com/c/text-normalization-challenge-english-language

Dataset

en_train.csv

sentence_id, token_id, class, before, after

en_test.csv

sentence_id, token_id, before

en_smaple_submission.csv

id, after

EDA

EDA: exploratory data analysis

https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering

Baseline

Change nothing: 0.9231

normalizer{0, 0_}

Change to the most frequent pattern in training set: 0.9867

normalizer1

More data && num2words && measure rules:

normalizer2

decimal, digit, money rules:

Actually this is of no use, since only 4 inches instance not appeared in training set.

Method

normalizer

Big data && num2words && measure rules: 0.9954

a_letter

$ grep dot baseline_ext_en.csv | wc -l
934

change all a_letter to a: 0.9957

ambiguous

LI: fifty one, fifty first, the fifty first, l i

build another dictionary with (before, class) as a key: 0.9978

test_2

0.9924 , rank 44

Rank 21 Solution

https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/43901

Use XGBoost with Context Label Data to predict test case's class.
For some class, apply customized normalize function.
Use XGBoost to deal with binary ambiguous case: like - and :.

Rank 19 Solution

https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/44049

Reclassing: 40 classes
Sub-tokens: -17.5 -> - 17 . 5
Classifier: LSTM with token index, sub-token type indexes, sub-token length indexes, token w2v embedding.
Limiting output types: by adding a large positive constant to valid classes.
Ensemble: two models, one for same replacement, while all other classes from the second model.
20% of training for validation. Adam optimizer. Cross entropy loss. Batch size 64. Epoch 10. 6 hours on GTX1060. Public dataset is used.

Rank 4 Solution

https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/43963

Statistical approach: possible transformations for each word with context. Plain text and common transformations.
Pattern based approach: regular expression. Dates, times, numbers, phones, URLs.
ML approach: several LightGBM models for decoding on ambiguous cases, mostly binary decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
reference		reference
test		test
.gitignore		.gitignore
README.md		README.md
helper.py		helper.py
normalizer.py		normalizer.py
predictor.py		predictor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset

en_train.csv

en_test.csv

en_smaple_submission.csv

EDA

Baseline

normalizer{0, 0_}

normalizer1

normalizer2

Method

normalizer

a_letter

ambiguous

test_2

Rank 21 Solution

Rank 19 Solution

Rank 4 Solution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataset

en_train.csv

en_test.csv

en_smaple_submission.csv

EDA

Baseline

normalizer{0, 0_}

normalizer1

normalizer2

Method

normalizer

a_letter

ambiguous

test_2

Rank 21 Solution

Rank 19 Solution

Rank 4 Solution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages