sentence_id, token_id, class, before, after
sentence_id, token_id, before
id, after
EDA: exploratory data analysis
https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering
Change nothing: 0.9231
Change to the most frequent pattern in training set: 0.9867
More data && num2words && measure rules:
decimal, digit, money rules:
Actually this is of no use, since only 4 inches instance not appeared in training set.
Big data && num2words && measure rules: 0.9954
$ grep dot baseline_ext_en.csv | wc -l
934
change all a_letter to a: 0.9957
LI: fifty one, fifty first, the fifty first, l i
build another dictionary with (before, class) as a key: 0.9978
0.9924 , rank 44
https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/43901
- Use XGBoost with Context Label Data to predict test case's class.
- For some class, apply customized normalize function.
- Use XGBoost to deal with binary ambiguous case: like
-and:.
https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/44049
- Reclassing: 40 classes
- Sub-tokens:
-17.5->-17.5 - Classifier: LSTM with token index, sub-token type indexes, sub-token length indexes, token w2v embedding.
- Limiting output types: by adding a large positive constant to valid classes.
- Ensemble: two models, one for same replacement, while all other classes from the second model.
- 20% of training for validation. Adam optimizer. Cross entropy loss. Batch size 64. Epoch 10. 6 hours on GTX1060. Public dataset is used.
https://www.kaggle.com/c/text-normalization-challenge-english-language/discussion/43963
- Statistical approach: possible transformations for each word with context. Plain text and common transformations.
- Pattern based approach: regular expression. Dates, times, numbers, phones, URLs.
- ML approach: several LightGBM models for decoding on ambiguous cases, mostly binary decisions.