Skip to content

Hords01/Data_Mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ—ž๏ธ Text Processing and Information Retrieval

This repository demonstrates key steps in natural language processing and information retrieval using Turkish text data. It covers text tokenization, TF-IDF vectorization, and sparse matrix generation.

๐Ÿ“Œ This project was submitted as a midterm replacement assignment by Emirkan Beyaz for the course "BGG - Bilgi Geri Getirimine GiriลŸ (Introduction to Information Retrieval)" taken during the 2023โ€“2024 Spring Semester, taught by Asst. Prof. Tolga Berber.


๐ŸŽฏ Project Objectives

  • Process Turkish textual data.
  • Apply BPE-based tokenization.
  • Calculate term frequencies and inverse document frequencies.
  • Construct a sparse TF-IDF matrix for information retrieval purposes.

๐Ÿงฉ Project Structure

  1. Data Loading
  2. Text Preprocessing
  3. Text Fragmentation
  4. Numerical Encoding
  5. TF-IDF Calculation
  6. Normalization
  7. Sparse Matrix Conversion
  8. Fullness Rate Calculation

๐Ÿ› ๏ธ Tools & Resources Used

  • ๐Ÿง  BPE-32 Tokenizer (Turkish)
    Byte-Pair Encoding tokenizer used for efficient subword tokenization, created in class under guidance of Asst. Prof. Tolga Berber.

  • ๐Ÿ“ฐ Turkish News Dataset
    A set of Turkish news articles in .txt format provided as course material.


๐Ÿ“Œ Citation / Attribution

If you use this repository, any part of the code, or the provided materials in your own projects, research, or publication, please cite or give proper credit as follows:

๐Ÿ”น Code & Project

Emirkan Beyaz, "Text Processing and Information Retrieval Project", GitHub Repository, 2025.
Please cite this repository or mention the author when using the code or project methodology.

๐Ÿ”น Dataset & Tokenizer Attribution

The Turkish dataset and BPE tokenizer used in this project were provided by:

Asst. Prof. Tolga Berber, "BPE Tokenizer and Turkish News Dataset",
Karadeniz Technical University, Department of Statistics and Computer Science, 2024.
If you use these resources, please cite or acknowledge Tolga Berber accordingly.


๐Ÿ“ฌ Contact

For any questions, suggestions, or issues, feel free to open an issue or contact me via GitHub.


Releases

No releases published

Packages

No packages published

Languages