| Model | Max sequence length | Epochs: Stoped Epoch | Early stopping(patience) | Batch size | learning_rate | Weight decay | optimizer | Accuracy | Precision | Recall | F1 | Training Time |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BERT | 128 | 4 | -(use fixed epochs) | 4 | 1e-5 | 0.01 | AdamW | 0.857484 | 0.855020 | 0.857484 | 0.855786 | 7:04:08 |
| FLAN-T5 | 128 | 4 | -(use fixed epochs) | 4 | 1e-5 | 0.01 | AdamW | 0.850928 | 0.846126 | 0.850928 | 0.846314 | 12:15:38 |
| GPT2 | 128 | 50:7 | 3 | 4 | 1e-5 | 0.01 | AdamW | 0.858421 | 0.854322 | 0.858421 | 0.854667 | 14:23:23 |
| FUNNEL | 128 | 50:6 | 3 | 32 | 1e-5 | 0.01 | AdamW | 0.859218 | 0.856484 | 0.859218 | 0.857384 | 3:16:23 |
dataset/raw: contains the original dataset, do not edit these filesdataset/preprocess: contains the dataset which has been processed byscripts/preprocessing.by
To get the train data,
- you can simply unzip the
dataset/preprocess/github-labels-top3-803k-train.csv.zip - or you run the script
scripts/preprocessing.ipynb
- BERT
- FLAN-T5
- GPT2
- FUNNEL
- id
- issue_url
- issue_label
- issue_created_at
- issue_author_association
- repository_url
- issue_title
- issue_body
- drop rows with empty/NAN in
issue_body/issue_title - drop rows which label is not in [bug, enhancement, question]
- concatenate
issue_titleandissue_bodyinto one metadata:issue_data. - replace tabs and breaks in the
issue_datawithspaces, then remove repeating whitespaces - tokenize
issue_datadata usingBertTokenizer - split data
- 85% training data
- 15% testing data
- how does “DistilBERT/BERT” works?
- what are the imbalance methods
- how to choose stratified by label
- What other NLP models are there besides BERT?
- Tell me more about ELECTRA
- How do I decide the number of epoch?
- I am training a flan-t5 model, please tell me what's wrong?
- In Huggingface Trainer, do I need to implictly set fp16=True?