Skip to content

TusharGarg07/Explainable-HIV-Drug-Resistance-System

Repository files navigation

🧬 Explainable HIV Drug Resistance Prediction System

Research Domain AI Deployment License


AI-driven clinical decision support system for interpretable HIV drug resistance prediction.

Designed as a computational dry-lab research platform integrating machine learning, genomics, and explainable AI.

🇯🇵 日本語

🧪 Research Objectives

エイズウイルス(HIV)の薬剤耐性メカニズムを解明し、臨床意思決定を支援する説明可能AIシステムを開発する。本システムは、アンサンブル機械学習モデルとSHAP(SHapley Additive exPlanations)に基づく解釈可能性フレームワークを統合し、個別配列と大規模ゲノムバッチ処理の両方で透明性の高い耐性評価を提供することを目的とする。

🔬 Methodology

データ表現

  • k-merエンコーディング: 配列を長さ6-13のk-merに分割し、頻度ベクトルとして表現
  • 疎行列形式: scipy.sparse.csr_matrixを用いた効率的な特徴表現
  • 遺伝子特異的特徴: RT(逆転写酵素)とPR(プロテアーゼ)で独立したk-mer辞書

モデル設計

  • アンサンブル学習: Random Forest + XGBoostの確率平均
  • 二値分類: 耐性/感受性の確率予測
  • 閾値分類: >0.75(高度耐性)、>0.5(中等度耐性)、>0.2(低度耐性)、感受性

説明可能性フレームワーク

  • SHAP値計算: TreeExplainerによる特徴寄与度の定量化
  • 局所解釈: 各予測に対するk-merレベルの影響分析
  • 可視化: 正負の影響を色分けしたインタラクティブチャート

🧬 Experimental Pipeline

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Streamlit UI  │────│   FastAPI API    │────│  ML Models +    │
│   (Port 8501)   │    │   (Port 8000)    │    │  SHAP Explainer │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Batch FASTA    │    │  REST Endpoints  │    │  Model Registry  │
│  Processing     │    │  /predict        │    │  (v1/)          │
│  & Visualization│    │  /explain        │    │  RT/PR Models   │
│                 │    │  /health         │    │  + K-mers       │
└─────────────────┘    └──────────────────┘    └─────────────────┘

📊 Explainability Framework

k-mer Impact Analysis

  • Resistance Drivers: Positive SHAP values indicating resistance-promoting k-mers
  • Susceptibility Indicators: Negative SHAP values for drug effectiveness
  • Clinical Interpretation: Actionable insights for treatment planning

Visualization System

  • Interactive Charts: Color-coded red (positive) / blue (negative) impacts
  • Drug-wise Tabs: Individual explanation views per medication
  • Batch Heatmaps: Resistance probability visualization across sequences

⚙️ Reproducibility

依存環境

  • Python 3.12+
  • Docker & Docker Compose
  • 8GB+ RAM(モデル読み込み時)

環境構築

# Dockerによる再現可能な環境
docker compose up --build

# ローカル開発環境
pip install -r requirements.txt
uvicorn app.api.main:app --reload
streamlit run ui/app.py

🏥 Clinical Disclaimer

This system is intended for research and educational purposes only. AI predictions serve as decision-support indicators and must not replace clinical judgment.

Final treatment decisions remain the responsibility of qualified clinicians.

⚠️ Research Use Disclaimer

  • 配列長制限: RT(>400bp)、PR(<400bp)の事前分類
  • モデル汎化性: 訓練データサブタイプへの依存
  • 計算コスト: SHAP値計算の計算負荷
  • 臨床使用: 研究目的限定、臨床決定支援ツールとして使用

🧭 倫理的配慮

  • データプライバシー: 配列データの匿名化処理
  • 解釈の責任: AI予測の臨床的解釈は専門医の判断を要する
  • 透明性: モデルの限界と不確実性の明示
  • 公平性: サブタイプバイアスの検証と緩和

📚 Citation

@software{hiv_drug_resistance_prediction,
  title={Explainable HIV Drug Resistance Prediction System},
  author={Tushar Garg},
  year={2025},
  url={https://github.com/TusharGarg07/hiv-drug-resistance-prediction},
  version={1.0.0},
  doi={10.5281/zenodo.XXXXXXX}
}

🇬🇧 English

🧪 Research Objectives

To develop an explainable AI system for elucidating HIV drug resistance mechanisms and supporting clinical decision-making. This system integrates ensemble machine learning models with a SHAP-based interpretability framework to provide transparent resistance assessments for both individual sequences and large-scale genomic batch processing.

🔬 Methodology

Data Representation

  • k-mer Encoding: Sequence segmentation into length 6-13 k-mers represented as frequency vectors
  • Sparse Matrix Format: Efficient feature representation using scipy.sparse.csr_matrix
  • Gene-specific Features: Independent k-mer dictionaries for RT (Reverse Transcriptase) and PR (Protease)

Model Design

  • Ensemble Learning: Random Forest + XGBoost probability averaging
  • Binary Classification: Resistance/susceptibility probability prediction
  • Threshold Classification: >0.75 (Highly Resistant), >0.5 (Medium Resistant), >0.2 (Low Resistant), Susceptible

Explainability Framework

  • SHAP Value Computation: Feature contribution quantification using TreeExplainer
  • Local Interpretation: k-mer level impact analysis for each prediction
  • Visualization: Interactive charts with color-coded positive/negative impacts

🧬 Experimental Pipeline

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Streamlit UI  │────│   FastAPI API    │────│  ML Models +    │
│   (Port 8501)   │    │   (Port 8000)    │    │  SHAP Explainer │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Batch FASTA    │    │  REST Endpoints  │    │  Model Registry  │
│  Processing     │    │  /predict        │    │  (v1/)          │
│  & Visualization│    │  /explain        │    │  RT/PR Models   │
│                 │    │  /health         │    │  + K-mers       │
└─────────────────┘    └──────────────────┘    └─────────────────┘

📊 Explainability Framework

k-mer Impact Analysis

  • Resistance Drivers: Positive SHAP values indicating resistance-promoting k-mers
  • Susceptibility Indicators: Negative SHAP values for drug effectiveness
  • Clinical Interpretation: Actionable insights for treatment planning

Visualization System

  • Interactive Charts: Color-coded red (positive) / blue (negative) impacts
  • Drug-wise Tabs: Individual explanation views per medication
  • Batch Heatmaps: Resistance probability visualization across sequences

⚙️ Reproducibility

Dependencies

  • Python 3.12+
  • Docker & Docker Compose
  • 8GB+ RAM (model loading)

Environment Setup

# Reproducible environment with Docker
docker compose up --build

# Local development environment
pip install -r requirements.txt
uvicorn app.api.main:app --reload
streamlit run ui/app.py

🏥 Clinical Disclaimer

This system is intended for research and educational purposes only. AI predictions serve as decision-support indicators and must not replace clinical judgment.

Final treatment decisions remain the responsibility of qualified clinicians.

⚠️ Research Use Disclaimer

  • Sequence Length Constraint: Pre-classification for RT (>400bp) and PR (<400bp)
  • Model Generalizability: Dependency on training data subtypes
  • Computational Cost: SHAP value calculation overhead
  • Clinical Use: Research-only, as clinical decision support tool

🧭 Ethical Considerations

  • Data Privacy: Sequence data anonymization
  • Interpretation Responsibility: Clinical interpretation requires expert physician judgment
  • Transparency: Clear communication of model limitations and uncertainties
  • Fairness: Subtype bias validation and mitigation

📚 Citation

@software{hiv_drug_resistance_prediction,
  title={Explainable HIV Drug Resistance Prediction System},
  author={Tushar Garg},
  year={2025},
  url={https://github.com/TusharGarg07/hiv-drug-resistance-prediction},
  version={1.0.0},
  doi={10.5281/zenodo.XXXXXXX}
}

📦 Installation & Deployment

Prerequisites

  • Python 3.12+
  • Docker & Docker Compose (recommended)

Local Development

git clone https://github.com/TusharGarg07/hiv-drug-resistance-prediction.git
cd hiv-drug-resistance-prediction
pip install -r requirements.txt

# Backend
uvicorn app.api.main:app --reload --host 0.0.0.0 --port 8000

# Frontend
streamlit run ui/app.py --server.port 8501

Docker Deployment

docker compose up --build
# Backend: http://localhost:8000/docs
# UI: http://localhost:8501

Cloud Deployment (Render)

  • Backend: FastAPI service with health checks
  • UI: Streamlit with BACKEND_URL environment variable
  • Model registry included in container builds

🔌 API Documentation

Prediction Endpoint

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"sequence": "ATGCGTATCGATCGATCGATCGATCGATCG", "gene_type": "rt"}'

Explanation Endpoint

curl -X POST "http://localhost:8000/explain" \
  -H "Content-Type: application/json" \
  -d '{"sequence": "ATGCGTATCGATCGATCGATCGATCGATCG", "gene_type": "rt", "top_k": 10}'

Health Check

curl http://localhost:8000/health

📊 Example Outputs

Prediction Response

{
  "model_version": "v1",
  "status": "success",
  "result": {
    "sequence_type": "RT",
    "predictions": [
      {
        "Drug": "AZT",
        "Resistance Level": "Highly Resistant",
        "Average Probability": 0.82,
        "Random Forest Probability": 0.79,
        "XGBoost Probability": 0.85
      }
    ]
  }
}

SHAP Explanation

{
  "model_version": "v1",
  "status": "success",
  "result": [
    {
      "drug": "AZT",
      "gene_type": "rt",
      "top_features": [
        {"feature": "ATGCGT", "impact": 0.42}
      ]
    }
  ]
}

🧬 Batch Genome Analysis

Upload FASTA files for high-throughput resistance prediction:

  • File Format: .fasta, .fa, .txt
  • Progress Tracking: Real-time processing status
  • Results: Sequence-wise resistance table and probability heatmap
  • Export: CSV/JSON for downstream analysis

🗂️ Repository Structure

hiv-drug-resistance-prediction/
├── README.md                     # This file
├── LICENSE                       # MIT License
├── CITATION.bib                  # Bibliography
├── CHANGELOG.md                  # Version history
├── environment.yml               # Conda environment
├── docs/                         # Research documentation
│   ├── methodology.md            # Detailed methodology
│   ├── system_architecture.md    # System design
│   ├── explainability_framework.md # SHAP framework
│   └── experimental_pipeline.md  # Experiment workflow
├── paper/                        # Publication materials
│   ├── abstract.md               # Paper abstract
│   ├── methods.md                # Methods section
│   ├── results.md                # Results section
│   ├── limitations.md            # Study limitations
│   └── future_work.md            # Future directions
├── experiments/                  # Experiment tracking
│   ├── README.md                 # Experiment overview
│   ├── experiment_v1_baseline.md # Baseline models
│   ├── experiment_v2_ensemble.md # Ensemble models
│   └── shap_analysis.md          # SHAP explainability
├── notebooks/                    # Research notebooks
│   ├── README.md                 # Notebook overview
│   ├── 01_data_analysis.ipynb    # Sequence data exploration
│   ├── 02_feature_engineering.ipynb # k-mer feature analysis
│   ├── 03_model_training.ipynb   # Model development
│   ├── 04_shap_analysis.ipynb    # Explainability analysis
│   └── 05_batch_analysis.ipynb   # Batch processing validation
├── results/                      # Experimental results
│   ├── figures/                  # Generated figures
│   ├── tables/                   # Result tables
│   └── exports/                  # Exported datasets
├── figures/                      # Static figures for papers
├── config/                       # System configuration
├── tests/                        # Test suite
├── models/                       # Trained models
│   └── v1/                       # Version 1 models
├── data_sample/                  # Sample datasets
├── scripts/                      # Utility scripts
├── app/                          # Application code
│   ├── api/                      # FastAPI backend
│   ├── core/                     # Core inference engine
│   ├── models/                   # Model loading utilities
│   ├── preprocessing/            # Data preprocessing
│   └── services/                 # Business logic
├── ui/                           # Streamlit frontend
├── Dockerfile                    # Backend container
├── Dockerfile.ui                 # UI container
├── docker-compose.yml            # Local orchestration
├── render.yaml                  # Render deployment
└── requirements.txt              # Dependencies

🚀 Future Directions

  • Multi-drug resistance prediction for combination therapy
  • Clinical decision support integration
  • Real-time variant tracking with sequence databases
  • Advanced explainability with counterfactual analysis
  • Mobile-responsive interface
  • GPU acceleration for batch processing
  • Prospective clinical validation studies

📜 License

This project is licensed under MIT License - see LICENSE file for details.


👨‍🔬 Research Contact

Computational Biology Research Laboratory
Bioinformatics & AI Research Division
[research-contact@example.edu]


This research prototype represents current work in explainable AI for healthcare applications, suitable for academic collaboration and computational biology research environments.

About

Explainable AI system for HIV drug resistance prediction using ensemble machine learning and SHAP-based genomic interpretability. HIVゲノム配列から薬剤耐性を予測し、SHAP解析による説明可能AIを実現する次世代バイオインフォマティクスシステム。

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors