The Tuberculosis Prediction System is a healthcare data science project designed to predict the likelihood of tuberculosis (TB) using machine learning techniques and image analysis. The project analyzes patient health data and applies classification algorithms to identify patterns associated with tuberculosis infection. Tuberculosis is a serious infectious disease caused by Mycobacterium tuberculosis that primarily affects the lungs. Early detection is critical to reduce disease spread and ensure timely treatment. This project demonstrates how data science, machine learning, and image-based analysis can assist healthcare professionals by identifying potential TB cases through predictive analytics and automated detection systems.
- Project Description
- Problem Statement
- Features
- Technologies Used
- Installation
- Usage
- Machine Learning Models Used
- Model Evaluation
- Project Workflow
- Key Insights
- Future Improvements
- Credits
- Author
Tuberculosis remains one of the leading infectious diseases worldwide. Early detection can significantly improve treatment outcomes and reduce transmission rates. However, diagnosing TB early can be challenging due to delayed symptom recognition and limited access to diagnostic tools. This project aims to:
- Analyze healthcare datasets related to tuberculosis
- Identify relationships between symptoms and TB diagnosis
- Apply machine learning classification algorithms to predict TB cases
- Provide image upload functionality to assist in detection
- Generate insights that can support healthcare decision-making
- Data cleaning and preprocessing
- Exploratory Data Analysis (EDA)
- Visualization of health indicators
- Machine learning-based TB prediction using multiple classification models
- Image upload functionality for TB detection assistance
- Model comparison and evaluation
- Insight generation from healthcare datasets
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- Streamlit / Web interface (for image upload and prediction)
- Jupyter Notebook
- Step 1 – Clone the repository - git clone https://github.com/VatsalMithal/TB_Project
- Step 2 – Navigate to the project folder - cd TB_Project
- Step 3 – Install required libraries - pip install -r requirements.txt
- Step 4 – Run the application or notebook - streamlit run app.py
After running the project, users can:
- Load and explore the tuberculosis dataset
- Perform exploratory data analysis on symptoms and patient information
- Train machine learning models for TB prediction
- Upload medical images to assist TB detection
- View prediction results and analysis outputs This demonstrates how machine learning models can support healthcare analytics and early disease detection.
This project applies multiple classification algorithms to predict whether a patient is likely to have tuberculosis.
The models implemented include:
- Logistic Regression Baseline model for predicting TB probability.
- Decision Tree Classifier Generates decision rules based on symptoms and health indicators.
- Random Forest Classifier Ensemble learning method that improves prediction accuracy by combining multiple decision trees.
- K-Nearest Neighbors (KNN) Classifies cases based on similarity to nearby data points.
- Support Vector Machine (SVM) Finds the optimal boundary separating TB and non-TB cases.
- Naive Bayes Probabilistic classification model suitable for medical datasets.
The models were evaluated using classification metrics including:
- Accuracy Score
- Confusion Matrix
- Precision
- Recall
- F1-Score
Example comparison of model performance:
| Model | Accuracy |
|---|---|
| Logistic Regression | ~85% |
| Decision Tree | ~87% |
| Random Forest | ~90% |
| KNN | ~86% |
| SVM | ~88% |
| Naive Bayes | ~84% |
Among these models, Random Forest typically achieved the highest accuracy, making it the most reliable model for TB prediction in this project.
The system also includes photo/image upload functionality where users can upload medical images to assist with tuberculosis detection. Image analysis workflow:
- Upload medical image
- Preprocess image data
- Extract relevant features
- Apply prediction model
- Display detection results
This demonstrates how AI and image analysis can assist in medical screening systems.
- Data Collection Obtain healthcare datasets containing tuberculosis-related patient information.
- Data Cleaning & Preprocessing Handle missing values and prepare the dataset for machine learning.
- Exploratory Data Analysis (EDA) Identify patterns between symptoms and TB diagnosis.
- Feature Engineering Select important health indicators affecting TB prediction.
- Machine Learning Modeling Train classification models including Logistic Regression, Decision Tree, Random Forest, KNN, SVM, and Naive Bayes.
- Image Processing Analyze uploaded medical images for TB detection.
- Model Evaluation Compare models using accuracy and classification metrics.
- Insight Generation Extract meaningful insights from the healthcare dataset.
- Symptoms such as persistent cough, fever, fatigue, and weight loss strongly correlate with TB cases.
- Machine learning models can effectively classify TB cases based on patient health indicators and symptom data.
- Ensemble models like Random Forest provide higher prediction accuracy compared to single models.
- Data visualization highlights patterns between symptom combinations and TB diagnosis probability.
- Image-based analysis can assist healthcare systems in faster screening and early detection of tuberculosis. These insights demonstrate how data science and machine learning can support healthcare analytics and disease prediction systems.
- Implement deep learning models for improved image-based TB detection
- Improve model accuracy with larger healthcare datasets
- Build an interactive healthcare dashboard
- Integrate real-time hospital datasets
- Deploy the system as a web-based healthcare prediction platform
This project was developed as part of the Master of Data Science Certification Program provided by GUVI – HCL and IIT Madras (IITM). Project guidance, documentation support, and development assistance were provided with the help of program mentors and ChatGPT.
Vatsal Mithal
Aspiring Data / Business Analyst