medulloblastoma-classifier

Input data and models

The following table summarizes how different data types are handled by default. If a method applies additional transformations, that is captured in the next table.

Dataset type	Baseline normalization/processing
Microarray	refine.bio processed with Single Channel Array Normalization, quantile normalization is skipped
Bulk RNA-seq	TPM
Smart-seq2 scRNA-seq	TPM
10X scRNA-seq	Counts

This table summarizes the models used in this work, the packages from which they originate, and any transformations applied to the input gene expression measures.

Model	Package	Additional transformations (if applicable)
k-Top Scoring Pairs (kTSP)	`multiclassPairs`	N/A
Random Forest (RF)	`multiclassPairs`	N/A
MM2S (Gendoo and Haibe-Kains. 2016.)	`MM2S`	N/A
medulloPackage (Rathi et al. 2020.)	`medulloPackage`	All RNA-seq data is log2-transformed
LASSO Logistic Regression	`glmnet`	Each sample is scaled to sum to 1

Internal development guidelines

These guidelines are intended to be used by Data Lab members and collaborators.

Dependency management

Docker

We expect development to primarily occur within the project Docker container. We use renv and conda as part of the build process, so please make use of those approaches when updating the Dockerfile (see sections below).

A GitHub Actions workflow builds and pushes the Docker image to the GitHub Container Registry any time the relevant environment files or Dockerfile are updated. It also checks on pull requests that alter relevant files that the image can be built.

To pull the most recent copy of the Docker image, use the following command:

docker pull ghcr.io/alexslemonade/medulloblastoma-classifier:latest

To run the container, use the following command from the root of this repository:

docker run \
  --mount type=bind,target=/home/rstudio/medulloblastoma-classifier,source=$PWD \
  -e PASSWORD={PASSWORD} \
  -p 8787:8787 \
  ghcr.io/alexslemonade/medulloblastoma-classifier:latest

Be sure to replace {PASSWORD}, including the curly braces, with a password of your choice.

You can then access the RStudio at http://localhost:8787 using the username rstudio and the password you just set.

Managing R packages with renv

We manage R package dependencies using renv.

When you install additional packages, please update the lockfile with the following command:

renv::snapshot()

When prompted, respond y to save the new packages in your renv.lock file. Commit the changes to the renv.lock file.

To pin any packages that are not automatically captured in the lockfile, you can add loading them to the dependencies.R file in the root of the repository.

Managing command-line tools and Python packages with Conda

We use Conda to manage command-line tools and Python packages.

To create and activate the environment, run the following from the root of the repository (requires conda-lock to be installed):

conda-lock install --name medulloblastoma-classifier conda-lock.yml
conda activate medulloblastoma-classifier

To add new packages to the Conda environment, add them to environment.yml, and then update the conda-lock.yml file:

conda-lock --file environment.yml

Pre-commit

We use pre-commit to make sure large files or secrets are not committed to the repository. The Conda environment contains pre-commit.

To setup the pre-commit hooks for this project, run the following from the root of the repository:

pre-commit install

Additional hooks for local development

If you would like to add additional hooks to use locally (e.g., to style and lint R files), you can by creating and using a .pre-commit-local.yaml file like so:

# make and activate a local pre-commit configuration
cp .pre-commit-config.yaml .pre-commit-local.yaml
pre-commit install --config .pre-commit-local.yaml

.pre-commit-local.yaml is ignored by Git, so you can modify that file without affecting other contributors.

Data and model management

We use an S3 bucket (s3://data-lab-mb-ssp) with versioning enabled to manage the files in the following directories:

data
models
processed_data
plots/data
results

Which are all present in the .gitignore file.

To push files to S3, use the following command from the root of the repository:

aws s3 sync {directory} s3://data-lab-mb-ssp/{directory}

Where {directory} should be one of: data, models, processed_data, plots/data, or results.

To pull files locally, use the following command from the root of the repository:

aws s3 sync s3://data-lab-mb-ssp/{directory} {directory}

A non-exhaustive list of aws s3 sync flags that may be useful:

--delete: Delete files that exist in the destination that are not in the source.
--dryrun: Performs a dry run without running the command.
--profile: A profile from your credential file.
--exclude: Exclude objects or files that match this pattern.
--include: Don't exclude objects or files that match this pattern.

Name		Name	Last commit message	Last commit date
Latest commit History 981 Commits
.github/workflows		.github/workflows
analysis_notebooks		analysis_notebooks
data		data
models		models
plots		plots
predict		predict
processed_data		processed_data
renv		renv
results		results
scripts		scripts
utils		utils
.Rprofile		.Rprofile
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
00-download_data.sh		00-download_data.sh
01-gather_metadata.R		01-gather_metadata.R
02-gather_data.R		02-gather_data.R
03-run_baseline_models.sh		03-run_baseline_models.sh
04-bulk_experiments.sh		04-bulk_experiments.sh
05-single_cell_experiments.sh		05-single_cell_experiments.sh
06-targeted_gene_set.sh		06-targeted_gene_set.sh
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
conda-lock.yml		conda-lock.yml
dependencies.R		dependencies.R
environment.yml		environment.yml
medulloblastoma-classifier.Rproj		medulloblastoma-classifier.Rproj
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

medulloblastoma-classifier

Input data and models

Internal development guidelines

Dependency management

Docker

Managing R packages with renv

Managing command-line tools and Python packages with Conda

Pre-commit

Additional hooks for local development

Data and model management

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

medulloblastoma-classifier

Input data and models

Internal development guidelines

Dependency management

Docker

Managing R packages with renv

Managing command-line tools and Python packages with Conda

Pre-commit

Additional hooks for local development

Data and model management

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages