Skip to content

Unified Wrapper#20

Merged
Claptar merged 36 commits intomainfrom
dev
Feb 18, 2026
Merged

Unified Wrapper#20
Claptar merged 36 commits intomainfrom
dev

Conversation

@Claptar
Copy link
Copy Markdown
Contributor

@Claptar Claptar commented Feb 18, 2026

STARsolo CLI v4.0 - Unified Wrapper

Major Changes

Unified CLI Architecture

  • Refactored 6 duplicate platform scripts into single bin/starsolo command
  • Modular design: lib/common.sh + platform-specific modules (lib/platform_*.sh)
  • Removed all Singularity dependencies - tools called directly from PATH

Platforms Supported

  • 10x Genomics (auto-detects chemistry v1-v4/multiome, strand, PE/SE)
  • Smart-seq2 (manifest-based)
  • Drop-seq, BD Rhapsody, inDrops, STRT-seq

Configuration System

  • Centralized defaults in etc/defaults.conf
  • Override via environment variables (STARSOLO_*) or CLI flags
  • Species-based reference resolution

Bug Fixes

  • Fixed path resolution: Convert REF and whitelist paths to absolute before cd
  • Fixed subshell exit code propagation

CI/CD & Testing

  • GitHub Actions workflow for multi-platform testing
  • Docker image build/push on tag creation (quay.io/cellgeni/starsolo)
  • Synthetic test data generator for all platforms

Developer Experience

  • install.sh for easy setup
  • Comprehensive help system (starsolo --help, starsolo <platform> --help)
  • Legacy scripts archived in scripts/legacy/

Breaking Changes

  • Command syntax changed from individual scripts to unified CLI
  • Before: starsolo_10x_auto.sh <args>
  • After: starsolo 10x <args>

Claptar and others added 30 commits June 5, 2025 16:06
- Enhanced the find_fastq_files function to improve clarity and maintainability.
- Updated check_compression to streamline the detection of file compression.
- Improved extract_test_reads for better handling of read extraction.
- Refined determine_barcode_whitelist and check_read_lengths for clearer logic and variable management.
- Enhanced determine_strand_specificity to better handle strand determination and output.
- Updated write_config_file to improve logging of configuration parameters.
- Refactored run_starsolo and process_output_files for better structure and readability.
- Improved main function for clearer argument parsing and workflow orchestration.
…script

- Introduced new scripts for STARsolo processing: starsolo_bd_rhapsody.sh, starsolo_dropseq.sh, starsolo_indrops.sh, starsolo_ss2.sh, starsolo_strt.sh.
- Each script is designed to handle specific sequencing protocols with appropriate parameters and configurations.
- Added a standalone QC script (solo_qc.sh) to aggregate and check quality control statistics across STARsolo runs.
- Ensured compatibility with the latest STAR version and included necessary error handling and usage instructions.
@Claptar Claptar self-assigned this Feb 18, 2026
@Claptar Claptar marked this pull request as ready for review February 18, 2026 17:26
Copilot AI review requested due to automatic review settings February 18, 2026 17:26
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d23076a788

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +30 to +31
cat *.R1_head | seqtk sample -s100 - 200000 > test.R1.fastq &
cat *.R2_head | seqtk sample -s100 - 200000 > test.R2.fastq &
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope 10x probe files to each sample run

These chemistry-detection intermediates are written with global names (test.R1.fastq, test.R2.fastq) in the current working directory, while run_10x only switches into the sample directory later. When two 10x jobs are launched from the same directory (for example via multiple bsub submissions), they can overwrite each other’s probe FASTQs and test outputs, which can flip whitelist/strand detection and lead to incorrect alignment settings for a sample.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This is a major architectural refactoring that consolidates 6 platform-specific scripts into a unified bin/starsolo CLI with modular design. The PR successfully removes Singularity dependencies, implements a centralized configuration system, and adds comprehensive CI/CD testing with Docker support.

Changes:

  • Unified CLI replacing 6 duplicate scripts with a single bin/starsolo command and modular platform libraries
  • Removed Singularity dependencies - all tools now called directly from PATH
  • Added centralized configuration system (etc/defaults.conf) with environment variable overrides
  • Implemented GitHub Actions workflows for multi-platform testing and Docker image building
  • Fixed critical path resolution bugs by converting paths to absolute before cd operations

Reviewed changes

Copilot reviewed 19 out of 30 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
bin/starsolo Main CLI entrypoint with platform dispatcher and argument parser
lib/common.sh Shared functions for FASTQ discovery, compression detection, and output processing
lib/platform_10x.sh 10x Genomics implementation with auto-detection of chemistry/strand/PE mode
lib/platform_smartseq.sh Smart-seq2 manifest-based processing
lib/platform_dropseq.sh Drop-seq platform without whitelist
lib/platform_rhapsody.sh BD Rhapsody with 3-segment barcodes
lib/platform_indrops.sh inDrops with adapter sequence support
lib/platform_strt.sh STRT-seq with 96-barcode whitelist
etc/defaults.conf Centralized default configuration
scripts/solo_qc.sh QC aggregation across multiple runs
scripts/generate_test_data.py Synthetic test data generator for all platforms
install.sh Installation script with symlink creation
Dockerfile Updated with tool versions and unified CLI installation
.github/workflows/test.yml Multi-platform testing workflow
.github/workflows/quay-on-tag.yml Docker image build and push on tag
README.md Comprehensive documentation update
scripts/starsolo_10x_auto.sh Removed (archived to scripts/legacy/)
scripts/legacy/*.sh Archived legacy scripts for reference

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

local CMD=${7:-""}

local FQDIR_ABS
FQDIR_ABS=$(readlink -f "$FQDIR")
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The archived legacy script converts FQDIR to absolute but then immediately changes directory without first converting REF, WL, and BC paths to absolute. This will cause path resolution issues if relative paths are used. However, since this is in the legacy/archived script and the new modular implementation correctly handles this (lines 219-221 in lib/platform_10x.sh convert paths before cd on line 224), this is only an issue if users try to run the legacy script directly.

Suggested change
FQDIR_ABS=$(readlink -f "$FQDIR")
FQDIR_ABS=$(readlink -f "$FQDIR")
REF=$(readlink -f "$REF")
WL=$(readlink -f "$WL")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why modify legacy

@@ -3,14 +3,14 @@ FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninteractive

ARG star_version=2.7.10a_alpha_220818
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The samtools version was upgraded from 1.15.1 to 1.21. While this is a minor version upgrade within the 1.x series, it's worth noting in case there are any behavioral changes. Consider documenting this version change in the PR description or release notes, especially since the README states "Tested version: 1.15.1".

Suggested change
ARG star_version=2.7.10a_alpha_220818
ARG star_version=2.7.10a_alpha_220818
# NOTE: samtools was previously pinned to version 1.15.1 (see README: "Tested version: 1.15.1").
# This Dockerfile intentionally uses a newer minor version (1.21). Update documentation as needed
# if behavior differences are observed between these versions.

Copilot uses AI. Check for mistakes.
--soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx \
--outReadsUnmapped Fastx

rm -f "$NEW_MANIFEST"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing directory to "$TAG" on line 33, the attempt to remove "$NEW_MANIFEST" on line 52 will fail because NEW_MANIFEST is set to "$TAG/manifest.tsv" (line 26), making the removal attempt target "$TAG/$TAG/manifest.tsv" which doesn't exist. The file should be removed as "manifest.tsv" (without the $TAG prefix) since we're already inside the $TAG directory, or it should be removed before the cd command.

Suggested change
rm -f "$NEW_MANIFEST"
rm -f "$TSV"

Copilot uses AI. Check for mistakes.
# Index BAM if it exists
if [[ -s "Aligned.sortedByCoord.out.bam" ]]; then
log_info "Indexing BAM file …"
samtools index -@16 Aligned.sortedByCoord.out.bam
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The samtools index command uses a hardcoded thread count of 16 (-@16), but this should ideally use the configurable CPUS value to be consistent with the STAR alignment. While samtools indexing is typically fast and this hardcoding may be intentional for simplicity, consider making it configurable or documenting why it's hardcoded.

Suggested change
samtools index -@16 Aligned.sortedByCoord.out.bam
samtools index -@${CPUS:-16} Aligned.sortedByCoord.out.bam

Copilot uses AI. Check for mistakes.
# Install this repository
COPY . /opt/STARsolo
RUN cd /opt/STARsolo &&\
rm data/test.tar.gz &&\
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile attempts to remove data/test.tar.gz before extracting data/whitelists.tar.gz, but there's no evidence that data/test.tar.gz exists in the repository being copied. If this file doesn't exist, the rm command will fail and stop the build due to set -e behavior in Docker RUN commands. Consider using rm -f to ignore non-existent files, or verify that this file should exist.

Suggested change
rm data/test.tar.gz &&\
rm -f data/test.tar.gz &&\

Copilot uses AI. Check for mistakes.
Comment on lines 16 to +27
@@ -20,10 +20,11 @@ RUN wget --no-check-certificate https://github.com/alexdobin/STAR/archive/${star
cd / && rm ${star_version}.tar.gz

#Install seqtk
RUN git clone https://github.com/lh3/seqtk.git && \
mv seqtk /opt && \
cd /opt/seqtk && \
make
RUN wget --no-check-certificate https://github.com/lh3/seqtk/archive/refs/tags/v${seqtk_version}.tar.gz && \
tar -xzf v${seqtk_version}.tar.gz -C /opt && \
cd /opt/seqtk-${seqtk_version} && \
make && \
cd / && rm v${seqtk_version}.tar.gz
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wget --no-check-certificate in the Dockerfile disables TLS certificate validation when downloading STAR and seqtk sources, so a network attacker (e.g. on a compromised proxy or mirror) can silently replace these tarballs with malicious code that will be compiled into the image. Because this image is then used to process real datasets, this becomes a supply-chain compromise vector for arbitrary code execution in all environments that build or run it. Use HTTPS with certificate validation and add an integrity check (pinned checksum or signature verification) for each downloaded artifact instead of disabling certificate checks.

Copilot uses AI. Check for mistakes.
@Claptar Claptar merged commit 882a73b into main Feb 18, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants