Brain Region Classifier

Multiclass classification of human brain regions from bulk RNA-seq transcriptomic data. Four machine learning models (KNN, SVM, Decision Tree, MLP) are benchmarked across multiple normalization and feature-selection pipelines to identify the most robust approach for mapping gene expression profiles to one of four macro-regions: Forebrain, Basal Ganglia, Midbrain, and Hindbrain.

Motivation

Accurate identification of brain regional identity from gene expression is critical for validating brain organoid models and studying neurological disease. Traditional approaches rely on a handful of marker genes, which fail to capture the full transcriptomic landscape. This project explores whether standard ML classifiers, trained on genome-wide expression data from 25 sub-regions across 265 adult brain samples, can reliably distinguish macro-regions, a necessary step toward automated organoid characterization.

Dataset

The data originates from bulk RNA-seq profiling of 25 anatomically distinct human brain sub-regions (Dong et al., 2022). Each of the 265 samples is labeled with one of four macro-regions based on neuroanatomical grouping:

Macro-Region	Sub-Regions
Forebrain	PVC, PAC, EC, PMC, VLPFC, PSC, OFC, PVPC, AMY, DLPFC, PSTC, HIPP, ITC, ACC, INS
Basal Ganglia	NAC, CN, GP
Midbrain	ARC, MDT, RMTG, VTA, DRN, HAB
Hindbrain	CRBLM

Preprocessing Pipelines

Eight dataset variants are generated from the raw counts to evaluate the impact of normalization and dimensionality reduction:

Dataset	Description
`data_raw`	Raw read counts
`data_raw_filtered`	Filtered raw counts (low-expression genes removed)
`data_raw_filtered_FS`	Filtered + variance-based feature selection
`data_logCPM`	Log-CPM normalized
`data_logCPM_centered`	Log-CPM, mean-centered
`data_logCPM_zscore`	Log-CPM, z-score standardized
`data_logCPM_FS`	Log-CPM + feature selection
`data_logCPM_centered_FS`	Log-CPM centered + feature selection

Feature selection uses VarianceThreshold, retaining features that explain >20% of the maximum observed variance (threshold adapted per normalization scheme due to differing value ranges).

Methods

All classifiers are evaluated with 10-fold stratified shuffle split (70/30 train/test) and scored on weighted accuracy, precision, recall, and F1-score.

Model	Key Hyperparameters
K-Nearest Neighbors	k = 1-30 (grid search), uniform weights, Euclidean distance
Support Vector Machine	RBF kernel, C in {0.1, 1, 10, 100}, gamma in {0.01, 0.001, 0.0001, 0.00001}
Decision Tree	criterion in {gini, entropy}, max_depth = 3, min_impurity_decrease in {0.001, 0.0005, 0.0025}
Multi-Layer Perceptron	ReLU activation, Adam solver, lr = 0.01, 500 max iterations

Hyperparameter ranking is performed by averaging test-set metrics across all eight datasets and ranking parameter sets per dataset.

Project Structure

.
├── data_preparation.py          # Raw data -> structured CSVs with normalization & feature selection
├── models/
│   ├── knn.py                   # KNN evaluation across k=1..30
│   ├── knn_compare_k.py         # Visualize optimal k per dataset
│   ├── svm.py                   # SVM evaluation (best hyperparams)
│   ├── svm_rank.py              # SVM hyperparameter ranking
│   ├── decision_tree.py         # Decision Tree evaluation (best hyperparams)
│   ├── decision_tree_rank.py    # DT hyperparameter ranking
│   └── mlp.py                   # MLP evaluation
├── utils/
│   ├── evaluation.py            # Cross-validation, metrics, shared dataset runner
│   ├── plotting.py              # Matplotlib theme, bar charts, confusion matrices
│   └── mlblabs.mplstyle         # Custom matplotlib stylesheet
├── requirements.txt
└── README.md

Setup

# Clone the repository
git clone https://github.com/laurabquintas/Brain-Region-Classifier.git
cd Brain-Region-Classifier

# Install dependencies
pip install -r requirements.txt

# Install the custom matplotlib style
cp utils/mlblabs.mplstyle "$(python -c 'import matplotlib; print(matplotlib.get_data_path())')/stylelib/"

Data

The processed datasets are available on Google Drive. Download and place the CSV files in a data/ directory at the project root.

To regenerate from the original RNA-seq files (source):

python data_preparation.py

Usage

Run any classifier against all dataset variants:

python -m models.knn           # K-Nearest Neighbors
python -m models.svm           # Support Vector Machine
python -m models.decision_tree # Decision Tree
python -m models.mlp           # Multi-Layer Perceptron

Run hyperparameter analysis:

python -m models.knn_compare_k      # Compare k values for KNN
python -m models.decision_tree_rank  # Rank DT hyperparameter sets
python -m models.svm_rank            # Rank SVM hyperparameter sets

References

Zheng, H., Feng, Y., Tang, J., & Ma, S. (2022). Interfacing brain organoids with precision medicine and machine learning. Cell Reports Physical Science, 3(7), 100974. doi:10.1016/j.xcrp.2022.100974
Tanaka, Y., Cakir, B., Xiang, Y., Sullivan, G.J., & Park, I.H. (2020). Synthetic Analyses of Single-Cell Transcriptomes from Multiple Brain Organoids and Fetal Brain. Cell Reports, 30(6), 1682-1689.e3. doi:10.1016/j.celrep.2020.01.038
Dong, P., Bendl, J., Misir, R., et al. (2022). Transcriptome and chromatin accessibility landscapes across 25 distinct human brain regions expand the susceptibility gene set for neuropsychiatric disorders. bioRxiv. doi:10.1101/2022.09.02.506419

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brain Region Classifier

Motivation

Dataset

Preprocessing Pipelines

Methods

Project Structure

Setup

Data

Usage

References

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
models		models
utils		utils
.gitignore		.gitignore
README.md		README.md
data_preparation.py		data_preparation.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Brain Region Classifier

Motivation

Dataset

Preprocessing Pipelines

Methods

Project Structure

Setup

Data

Usage

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages