Text-Normalization/README.md at master · rudranshtripathi9/Text-Normalization

Dataset

en_train.csv

sentence_id, token_id, class, before, after

en_test.csv

sentence_id, token_id, before

en_smaple_submission.csv

id, after

EDA

EDA: exploratory data analysis

https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering

Baseline

Change nothing: 0.9231

normalizer{0, 0_}

Change to the most frequent pattern in training set: 0.9867

normalizer1

More data && num2words && measure rules:

normalizer2

decimal, digit, money rules:

Actually this is of no use, since only 4 inches instance not appeared in training set.

Method

normalizer

Big data && num2words && measure rules: 0.9954

a_letter

$ grep dot baseline_ext_en.csv | wc -l
934

change all a_letter to a: 0.9957

ambiguous

LI: fifty one, fifty first, the fifty first, l i

build another dictionary with (before, class) as a key: 0.9978

test_2

0.9924 , rank 44

Rank 21 Solution

https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/43901

Use XGBoost with Context Label Data to predict test case's class.
For some class, apply customized normalize function.
Use XGBoost to deal with binary ambiguous case: like - and :.

Rank 19 Solution

https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/44049

Reclassing: 40 classes
Sub-tokens: -17.5 -> - 17 . 5
Classifier: LSTM with token index, sub-token type indexes, sub-token length indexes, token w2v embedding.
Limiting output types: by adding a large positive constant to valid classes.
Ensemble: two models, one for same replacement, while all other classes from the second model.
20% of training for validation. Adam optimizer. Cross entropy loss. Batch size 64. Epoch 10. 6 hours on GTX1060. Public dataset is used.

Rank 4 Solution

https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/43963

Statistical approach: possible transformations for each word with context. Plain text and common transformations.
Pattern based approach: regular expression. Dates, times, numbers, phones, URLs.
ML approach: several LightGBM models for decoding on ambiguous cases, mostly binary decisions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset

en_train.csv

en_test.csv

en_smaple_submission.csv

EDA

Baseline

normalizer{0, 0_}

normalizer1

normalizer2

Method

normalizer

a_letter

ambiguous

test_2

Rank 21 Solution

Rank 19 Solution

Rank 4 Solution

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Dataset

en_train.csv

en_test.csv

en_smaple_submission.csv

EDA

Baseline

normalizer{0, 0_}

normalizer1

normalizer2

Method

normalizer

a_letter

ambiguous

test_2

Rank 21 Solution

Rank 19 Solution

Rank 4 Solution