EDSA-2201-2207-classification-hackathon

Overview

The Explore Data Science Academy Classification Hackathon is a challenge that involves identifying the language of a given text from any of South Africa's 11 official languages. This is an example of NLP's Language Identification, which requires the task of determining the natural language that a piece of text is written in.

South Africa is a multilingual country, and most South Africans are multilingual, able to speak at least two or more of the official languages. The challenge aims to build a system that can communicate in multi-languages to deepen democracy and contribute to the social, cultural, intellectual, economic and political life of the South African society.

The dataset used for this challenge is the NCHLT Text Corpora, collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt. The dataset contains Language ID and text, and the text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data.

File descriptions

The repository contains the following files:

train_set.csv: This is the training set that is used to train the model.

test_set.csv: This is the test set that is used to evaluate the model's performance.

you can use the Language Identification Hack Solution(LebusoTsilo).ipynb Jupyter notebook to train the model and make predictions.

Note: The model is saved in models/model.pkl.

Conclusion

This repository contains the code to solve the Explore Data Science Academy Classification Hackathon challenge. By using NLP techniques, we were able to accurately identify the language of a given text from any of South Africa's 11 official languages. This model can be used to build

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Language Identification Hack Solution(LebusoTsilo).ipynb		Language Identification Hack Solution(LebusoTsilo).ipynb
Multinomial_model.csv		Multinomial_model.csv
README.md		README.md
RFC_model.csv		RFC_model.csv
SVC_model.csv		SVC_model.csv
Sub4.csv		Sub4.csv
logistic_model.csv		logistic_model.csv
multinomial_2.csv		multinomial_2.csv
test_set.csv		test_set.csv
train_set.csv		train_set.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDSA-2201-2207-classification-hackathon

Overview

File descriptions

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EDSA-2201-2207-classification-hackathon

Overview

File descriptions

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages