The following table summarizes how different data types are handled by default. If a method applies additional transformations, that is captured in the next table.
| Dataset type | Baseline normalization/processing |
|---|---|
| Microarray | refine.bio processed with Single Channel Array Normalization, quantile normalization is skipped |
| Bulk RNA-seq | TPM |
| Smart-seq2 scRNA-seq | TPM |
| 10X scRNA-seq | Counts |
This table summarizes the models used in this work, the packages from which they originate, and any transformations applied to the input gene expression measures.
| Model | Package | Additional transformations (if applicable) |
|---|---|---|
| k-Top Scoring Pairs (kTSP) | multiclassPairs |
N/A |
| Random Forest (RF) | multiclassPairs |
N/A |
| MM2S (Gendoo and Haibe-Kains. 2016.) | MM2S |
N/A |
| medulloPackage (Rathi et al. 2020.) | medulloPackage |
All RNA-seq data is log2-transformed |
| LASSO Logistic Regression | glmnet |
Each sample is scaled to sum to 1 |
These guidelines are intended to be used by Data Lab members and collaborators.
We expect development to primarily occur within the project Docker container.
We use renv and conda as part of the build process, so please make use of those approaches when updating the Dockerfile (see sections below).
A GitHub Actions workflow builds and pushes the Docker image to the GitHub Container Registry any time the relevant environment files or Dockerfile are updated.
It also checks on pull requests that alter relevant files that the image can be built.
To pull the most recent copy of the Docker image, use the following command:
docker pull ghcr.io/alexslemonade/medulloblastoma-classifier:latestTo run the container, use the following command from the root of this repository:
docker run \
--mount type=bind,target=/home/rstudio/medulloblastoma-classifier,source=$PWD \
-e PASSWORD={PASSWORD} \
-p 8787:8787 \
ghcr.io/alexslemonade/medulloblastoma-classifier:latestBe sure to replace {PASSWORD}, including the curly braces, with a password of your choice.
You can then access the RStudio at http://localhost:8787 using the username rstudio and the password you just set.
We manage R package dependencies using renv.
When you install additional packages, please update the lockfile with the following command:
renv::snapshot()When prompted, respond y to save the new packages in your renv.lock file.
Commit the changes to the renv.lock file.
To pin any packages that are not automatically captured in the lockfile, you can add loading them to the dependencies.R file in the root of the repository.
We use Conda to manage command-line tools and Python packages.
To create and activate the environment, run the following from the root of the repository (requires conda-lock to be installed):
conda-lock install --name medulloblastoma-classifier conda-lock.yml
conda activate medulloblastoma-classifierTo add new packages to the Conda environment, add them to environment.yml, and then update the conda-lock.yml file:
conda-lock --file environment.ymlWe use pre-commit to make sure large files or secrets are not committed to the repository. The Conda environment contains pre-commit.
To setup the pre-commit hooks for this project, run the following from the root of the repository:
pre-commit installIf you would like to add additional hooks to use locally (e.g., to style and lint R files), you can by creating and using a .pre-commit-local.yaml file like so:
# make and activate a local pre-commit configuration
cp .pre-commit-config.yaml .pre-commit-local.yaml
pre-commit install --config .pre-commit-local.yaml.pre-commit-local.yaml is ignored by Git, so you can modify that file without affecting other contributors.
We use an S3 bucket (s3://data-lab-mb-ssp) with versioning enabled to manage the files in the following directories:
datamodelsprocessed_dataplots/dataresults
Which are all present in the .gitignore file.
To push files to S3, use the following command from the root of the repository:
aws s3 sync {directory} s3://data-lab-mb-ssp/{directory}Where {directory} should be one of: data, models, processed_data, plots/data, or results.
To pull files locally, use the following command from the root of the repository:
aws s3 sync s3://data-lab-mb-ssp/{directory} {directory}A non-exhaustive list of aws s3 sync flags that may be useful:
--delete: Delete files that exist in the destination that are not in the source.--dryrun: Performs a dry run without running the command.--profile: A profile from your credential file.--exclude: Exclude objects or files that match this pattern.--include: Don't exclude objects or files that match this pattern.