ImmuneSubtypeClassifier is an R package for robust immune subtype classification of tumor samples using gene expression data. It uses the "Robencla" (Robust Ensemble Classifier) framework (github.com/gibbsdavidl/Robencla), utilizing an ensemble of XGBoost models to assign samples to one of six immune subtypes (C1–C6).
This package requires xgboost version < 2.0.0.
Due to major changes in the XGBoost model serialization format, models trained on version 1.x cannot be loaded by version 2.x or 3.x. The pre-trained models included in this package were built using XGBoost 1.7.x.
To ensure compatibility, this package enforces xgboost (< 2.0.0). Using this package with renv is encouraged,
but if you are setting up your environment manually:
# Install a compatible version of xgboost
remotes::install_version("xgboost", version = "1.7.8.1")
You can install the development version of ImmuneSubtypeClassifier from GitHub:
# Install devtools if you haven't already
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("gibbsdavidl/robencla")
devtools::install_github("gibbsdavidl/ImmuneSubtypeClassifier")
Data format: Genes must be in columns, samples are in rows, and there should be a column of sample IDs.
Make sure you have all the expected genes by using the getFeaturesPairList or getFeaturesGeneTable functions.
The function getFeaturesGeneTable() returns the gene ID table used with the
TCGA PanCancer EB++ expression training data (hg19). Use this function to subset your data with
columns Symbol, Entrez, and Ensembl.
The function getFeaturesPairList() returns a named list, C1-C6, where feature pairs
are defined. Pairs are (1,2), (3,4), (5,6), etc. for each cluster label.
The main function callSubtypes() handles gene matching, data transformation, and prediction.
library(ImmuneSubtypeClassifier)
# Get the list of feature-pair genes used for the model.
# ...there's model_genes_list$model_genes and model_genes_list$gene_map
model_genes_list <- modelGenes(model_path='../models/immune_optimized_99_pairs.rds')
# Confirm the gene_map contains all genes.
length(model_genes$model_genes) == nrow(model_genes$gene_map)
# Call the subtypes, can also pass in a data.frame / tibble etc.
result <- callSubtypes(X_or_path = '../data/gene_expression_rsem_tpm.csv.gz',
model = NULL, # genes in columns and samples in rows.
model_path = '../models/immune_optimized_99_pairs.rds',
geneid = "symbol", # how are the gene IDs encoded
sampleid = 'Barcode', # column name with sample IDs
labelid= 'Label') # column name with sample Labels (optional)
# get the trained robencla model.
model <- result$Model
# get the results table, $BestCall is the predicted subtype
pred <- result$Pred
# confusion matrix
table(pred$BestCall, pred$Label)
# get class prediction metrics.
model$classification_metrics(labels = pred$Label, calls = pred$BestCall)
The output is a data frame containing:
- SampleIDs: The sample identifiers.
- BestCall: The predicted immune subtype (1–6).
- 1–6: Probability scores for each subtype.
| SampleIDs | BestCall | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| Sample_01 | 3 | 0.02 | 0.10 | 0.85 | 0.01 | 0.01 | 0.01 |
| Sample_02 | 1 | 0.78 | 0.05 | 0.10 | 0.02 | 0.04 | 0.01 |
If you have a labeled training set and wish to rebuild the model (e.g., after updating XGBoost or changing feature pairs), you can use build_robencla_classifier. I can also send you the PanCancer training set, already formatted.
- CSV file with samples as rows and genes as columns.
- Must contain a Label column (e.g., "C1", "C2") and a Sample ID column.
library(ImmuneSubtypeClassifier)
# Conservative parameters, see robencla & xgboost for parameters
conservative_params <- list(
max_depth = 8,
eta = 0.2,
nrounds = 64,
early_stopping_rounds = 4,
gamma = 0.3,
lambda = 1.9,
alpha = 0.3,
ensemble_size = 11,
sample_prop = 0.8,
feature_prop = 0.8,
subsample = 0.8
)
# then load up a list of gene pairs...
pair_list <- readRDS('../models/pair_list_stratified.rds')
# if there's a gene you want to try removing
this_set <- editPairList(pair_list, 'IGJ')
# Now use the cleaned pair_list
result <- build_robencla_classifier(
data_path='../data/training_expression_data.csv.gz',
test_path='../data/testing_expression_data.csv.gz',
output_path = '../models/immune_optimized_99_pairs.rds',
pair_list = this_set,
sig_list = NULL,
param_list = conservative_params,
data_mode = c("namedpairs"),
train_fraction = NULL, # used if test_path is NULL
seed = 412,
sample_id = "Barcode" # Specify the sample ID column
)
The classifier assigns one of six immune subtypes:
- C1 (Wound Healing): High proliferation, angiogenic gene expression.
- C2 (IFN-gamma Dominant): High M1/M2 macrophage polarization, strong CD8 signal.
- C3 (Inflammatory): Elevated Th17 and Th1 genes, low tumor cell proliferation.
- C4 (Lymphocyte Depleted): Macrophage sequestration, Th2 shift.
- C5 (Immunologically Quiet): Low lymphocyte and macrophage responses.
- C6 (TGF-beta Dominant): High TGF-beta signature, lymphocytic infiltration.
This means you are trying to load the model with a newer version of XGBoost (>= 2.0) than was used to train it.
Fix: Downgrade XGBoost to 1.7.8.1 or re-train the model using build_robencla_classifier.
This project is licensed under the terms of the MIT license.