Skip to content

VatsalMithal/TB_Project

Repository files navigation

Tuberculosis Prediction System – Data Science Project

Project Description

The Tuberculosis Prediction System is a healthcare data science project designed to predict the likelihood of tuberculosis (TB) using machine learning techniques and image analysis. The project analyzes patient health data and applies classification algorithms to identify patterns associated with tuberculosis infection. Tuberculosis is a serious infectious disease caused by Mycobacterium tuberculosis that primarily affects the lungs. Early detection is critical to reduce disease spread and ensure timely treatment. This project demonstrates how data science, machine learning, and image-based analysis can assist healthcare professionals by identifying potential TB cases through predictive analytics and automated detection systems.


Table of Contents

  1. Project Description
  2. Problem Statement
  3. Features
  4. Technologies Used
  5. Installation
  6. Usage
  7. Machine Learning Models Used
  8. Model Evaluation
  9. Project Workflow
  10. Key Insights
  11. Future Improvements
  12. Credits
  13. Author

Problem Statement

Tuberculosis remains one of the leading infectious diseases worldwide. Early detection can significantly improve treatment outcomes and reduce transmission rates. However, diagnosing TB early can be challenging due to delayed symptom recognition and limited access to diagnostic tools. This project aims to:

  • Analyze healthcare datasets related to tuberculosis
  • Identify relationships between symptoms and TB diagnosis
  • Apply machine learning classification algorithms to predict TB cases
  • Provide image upload functionality to assist in detection
  • Generate insights that can support healthcare decision-making

Features

  • Data cleaning and preprocessing
  • Exploratory Data Analysis (EDA)
  • Visualization of health indicators
  • Machine learning-based TB prediction using multiple classification models
  • Image upload functionality for TB detection assistance
  • Model comparison and evaluation
  • Insight generation from healthcare datasets

Technologies Used

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • Streamlit / Web interface (for image upload and prediction)
  • Jupyter Notebook

Installation

  • Step 1 – Clone the repository - git clone https://github.com/VatsalMithal/TB_Project
  • Step 2 – Navigate to the project folder - cd TB_Project
  • Step 3 – Install required libraries - pip install -r requirements.txt
  • Step 4 – Run the application or notebook - streamlit run app.py

Usage

After running the project, users can:

  • Load and explore the tuberculosis dataset
  • Perform exploratory data analysis on symptoms and patient information
  • Train machine learning models for TB prediction
  • Upload medical images to assist TB detection
  • View prediction results and analysis outputs This demonstrates how machine learning models can support healthcare analytics and early disease detection.

Machine Learning Models Used

This project applies multiple classification algorithms to predict whether a patient is likely to have tuberculosis.

The models implemented include:

  • Logistic Regression Baseline model for predicting TB probability.
  • Decision Tree Classifier Generates decision rules based on symptoms and health indicators.
  • Random Forest Classifier Ensemble learning method that improves prediction accuracy by combining multiple decision trees.
  • K-Nearest Neighbors (KNN) Classifies cases based on similarity to nearby data points.
  • Support Vector Machine (SVM) Finds the optimal boundary separating TB and non-TB cases.
  • Naive Bayes Probabilistic classification model suitable for medical datasets.

Model Evaluation

The models were evaluated using classification metrics including:

  • Accuracy Score
  • Confusion Matrix
  • Precision
  • Recall
  • F1-Score

Example comparison of model performance:

Model Accuracy
Logistic Regression ~85%
Decision Tree ~87%
Random Forest ~90%
KNN ~86%
SVM ~88%
Naive Bayes ~84%

Among these models, Random Forest typically achieved the highest accuracy, making it the most reliable model for TB prediction in this project.


Image Analysis for TB Detection

The system also includes photo/image upload functionality where users can upload medical images to assist with tuberculosis detection. Image analysis workflow:

  1. Upload medical image
  2. Preprocess image data
  3. Extract relevant features
  4. Apply prediction model
  5. Display detection results

This demonstrates how AI and image analysis can assist in medical screening systems.


Project Workflow

  • Data Collection Obtain healthcare datasets containing tuberculosis-related patient information.
  • Data Cleaning & Preprocessing Handle missing values and prepare the dataset for machine learning.
  • Exploratory Data Analysis (EDA) Identify patterns between symptoms and TB diagnosis.
  • Feature Engineering Select important health indicators affecting TB prediction.
  • Machine Learning Modeling Train classification models including Logistic Regression, Decision Tree, Random Forest, KNN, SVM, and Naive Bayes.
  • Image Processing Analyze uploaded medical images for TB detection.
  • Model Evaluation Compare models using accuracy and classification metrics.
  • Insight Generation Extract meaningful insights from the healthcare dataset.

Key Insights

  • Symptoms such as persistent cough, fever, fatigue, and weight loss strongly correlate with TB cases.
  • Machine learning models can effectively classify TB cases based on patient health indicators and symptom data.
  • Ensemble models like Random Forest provide higher prediction accuracy compared to single models.
  • Data visualization highlights patterns between symptom combinations and TB diagnosis probability.
  • Image-based analysis can assist healthcare systems in faster screening and early detection of tuberculosis. These insights demonstrate how data science and machine learning can support healthcare analytics and disease prediction systems.

Future Improvements

  • Implement deep learning models for improved image-based TB detection
  • Improve model accuracy with larger healthcare datasets
  • Build an interactive healthcare dashboard
  • Integrate real-time hospital datasets
  • Deploy the system as a web-based healthcare prediction platform

Credits

This project was developed as part of the Master of Data Science Certification Program provided by GUVI – HCL and IIT Madras (IITM). Project guidance, documentation support, and development assistance were provided with the help of program mentors and ChatGPT.


Author

Vatsal Mithal

Aspiring Data / Business Analyst


About

Healthcare analytics project using multiple ML classification models and image analysis to predict tuberculosis risk based on patient health indicators and symptoms.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors