The Explore Data Science Academy Classification Hackathon is a challenge that involves identifying the language of a given text from any of South Africa's 11 official languages. This is an example of NLP's Language Identification, which requires the task of determining the natural language that a piece of text is written in.
South Africa is a multilingual country, and most South Africans are multilingual, able to speak at least two or more of the official languages. The challenge aims to build a system that can communicate in multi-languages to deepen democracy and contribute to the social, cultural, intellectual, economic and political life of the South African society.
The dataset used for this challenge is the NCHLT Text Corpora, collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt. The dataset contains Language ID and text, and the text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data.
The repository contains the following files:
train_set.csv: This is the training set that is used to train the model.
test_set.csv: This is the test set that is used to evaluate the model's performance.
you can use the Language Identification Hack Solution(LebusoTsilo).ipynb Jupyter notebook to train the model and make predictions.
Note: The model is saved in models/model.pkl.
This repository contains the code to solve the Explore Data Science Academy Classification Hackathon challenge. By using NLP techniques, we were able to accurately identify the language of a given text from any of South Africa's 11 official languages. This model can be used to build