Skip to content

philiase/EDSA-2201-2207-classification-hackathon

Repository files navigation

EDSA-2201-2207-classification-hackathon

Overview

The Explore Data Science Academy Classification Hackathon is a challenge that involves identifying the language of a given text from any of South Africa's 11 official languages. This is an example of NLP's Language Identification, which requires the task of determining the natural language that a piece of text is written in.

South Africa is a multilingual country, and most South Africans are multilingual, able to speak at least two or more of the official languages. The challenge aims to build a system that can communicate in multi-languages to deepen democracy and contribute to the social, cultural, intellectual, economic and political life of the South African society.

The dataset used for this challenge is the NCHLT Text Corpora, collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt. The dataset contains Language ID and text, and the text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data.

File descriptions

The repository contains the following files:

train_set.csv: This is the training set that is used to train the model.

test_set.csv: This is the test set that is used to evaluate the model's performance.

you can use the Language Identification Hack Solution(LebusoTsilo).ipynb Jupyter notebook to train the model and make predictions.

Note: The model is saved in models/model.pkl.

Conclusion

This repository contains the code to solve the Explore Data Science Academy Classification Hackathon challenge. By using NLP techniques, we were able to accurately identify the language of a given text from any of South Africa's 11 official languages. This model can be used to build

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors