GitHub Issue Report Classification

Comparison

Model	Max sequence length	Epochs: Stoped Epoch	Early stopping(patience)	Batch size	learning_rate	Weight decay	optimizer	Accuracy	Precision	Recall	F1	Training Time
BERT	128	4	-(use fixed epochs)	4	1e-5	0.01	AdamW	0.857484	0.855020	0.857484	0.855786	7:04:08
FLAN-T5	128	4	-(use fixed epochs)	4	1e-5	0.01	AdamW	0.850928	0.846126	0.850928	0.846314	12:15:38
GPT2	128	50:7	3	4	1e-5	0.01	AdamW	0.858421	0.854322	0.858421	0.854667	14:23:23
FUNNEL	128	50:6	3	32	1e-5	0.01	AdamW	0.859218	0.856484	0.859218	0.857384	3:16:23

Dataset

dataset/raw: contains the original dataset, do not edit these files
dataset/preprocess: contains the dataset which has been processed by scripts/preprocessing.by

To get the train data,

you can simply unzip the dataset/preprocess/github-labels-top3-803k-train.csv.zip
or you run the script scripts/preprocessing.ipynb

Models

BERT
FLAN-T5
GPT2
FUNNEL

DataSet

id
issue_url
issue_label
issue_created_at
issue_author_association
repository_url
issue_title
issue_body

Preprocessing

data cleaning

drop rows with empty/NAN in issue_body/issue_title
drop rows which label is not in [bug, enhancement, question]
concatenate issue_title and issue_body into one metadata: issue_data.
replace tabs and breaks in the issue_data with spaces, then remove repeating whitespaces
tokenize issue_data data using BertTokenizer
split data
- 85% training data
- 15% testing data

AI-Assistance log

how does “DistilBERT/BERT” works?
what are the imbalance methods
how to choose stratified by label
What other NLP models are there besides BERT?
Tell me more about ELECTRA
How do I decide the number of epoch?
I am training a flan-t5 model, please tell me what's wrong?
In Huggingface Trainer, do I need to implictly set fp16=True?

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
dataset		dataset
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
Zander.md		Zander.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Issue Report Classification

Comparison

Dataset

Models

DataSet

Preprocessing

data cleaning

AI-Assistance log

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ZanderZhan/ML-Project

Folders and files

Latest commit

History

Repository files navigation

GitHub Issue Report Classification

Comparison

Dataset

Models

DataSet

Preprocessing

data cleaning

AI-Assistance log

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages