Conversation
- Enhanced the find_fastq_files function to improve clarity and maintainability. - Updated check_compression to streamline the detection of file compression. - Improved extract_test_reads for better handling of read extraction. - Refined determine_barcode_whitelist and check_read_lengths for clearer logic and variable management. - Enhanced determine_strand_specificity to better handle strand determination and output. - Updated write_config_file to improve logging of configuration parameters. - Refactored run_starsolo and process_output_files for better structure and readability. - Improved main function for clearer argument parsing and workflow orchestration.
…script - Introduced new scripts for STARsolo processing: starsolo_bd_rhapsody.sh, starsolo_dropseq.sh, starsolo_indrops.sh, starsolo_ss2.sh, starsolo_strt.sh. - Each script is designed to handle specific sequencing protocols with appropriate parameters and configurations. - Added a standalone QC script (solo_qc.sh) to aggregate and check quality control statistics across STARsolo runs. - Ensured compatibility with the latest STAR version and included necessary error handling and usage instructions.
…emove fails-fast option
…before writing to output directory
…edundant variables
…treamline installation process
…mproved performance
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d23076a788
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cat *.R1_head | seqtk sample -s100 - 200000 > test.R1.fastq & | ||
| cat *.R2_head | seqtk sample -s100 - 200000 > test.R2.fastq & |
There was a problem hiding this comment.
Scope 10x probe files to each sample run
These chemistry-detection intermediates are written with global names (test.R1.fastq, test.R2.fastq) in the current working directory, while run_10x only switches into the sample directory later. When two 10x jobs are launched from the same directory (for example via multiple bsub submissions), they can overwrite each other’s probe FASTQs and test outputs, which can flip whitelist/strand detection and lead to incorrect alignment settings for a sample.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This is a major architectural refactoring that consolidates 6 platform-specific scripts into a unified bin/starsolo CLI with modular design. The PR successfully removes Singularity dependencies, implements a centralized configuration system, and adds comprehensive CI/CD testing with Docker support.
Changes:
- Unified CLI replacing 6 duplicate scripts with a single
bin/starsolocommand and modular platform libraries - Removed Singularity dependencies - all tools now called directly from PATH
- Added centralized configuration system (
etc/defaults.conf) with environment variable overrides - Implemented GitHub Actions workflows for multi-platform testing and Docker image building
- Fixed critical path resolution bugs by converting paths to absolute before
cdoperations
Reviewed changes
Copilot reviewed 19 out of 30 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| bin/starsolo | Main CLI entrypoint with platform dispatcher and argument parser |
| lib/common.sh | Shared functions for FASTQ discovery, compression detection, and output processing |
| lib/platform_10x.sh | 10x Genomics implementation with auto-detection of chemistry/strand/PE mode |
| lib/platform_smartseq.sh | Smart-seq2 manifest-based processing |
| lib/platform_dropseq.sh | Drop-seq platform without whitelist |
| lib/platform_rhapsody.sh | BD Rhapsody with 3-segment barcodes |
| lib/platform_indrops.sh | inDrops with adapter sequence support |
| lib/platform_strt.sh | STRT-seq with 96-barcode whitelist |
| etc/defaults.conf | Centralized default configuration |
| scripts/solo_qc.sh | QC aggregation across multiple runs |
| scripts/generate_test_data.py | Synthetic test data generator for all platforms |
| install.sh | Installation script with symlink creation |
| Dockerfile | Updated with tool versions and unified CLI installation |
| .github/workflows/test.yml | Multi-platform testing workflow |
| .github/workflows/quay-on-tag.yml | Docker image build and push on tag |
| README.md | Comprehensive documentation update |
| scripts/starsolo_10x_auto.sh | Removed (archived to scripts/legacy/) |
| scripts/legacy/*.sh | Archived legacy scripts for reference |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local CMD=${7:-""} | ||
|
|
||
| local FQDIR_ABS | ||
| FQDIR_ABS=$(readlink -f "$FQDIR") |
There was a problem hiding this comment.
The archived legacy script converts FQDIR to absolute but then immediately changes directory without first converting REF, WL, and BC paths to absolute. This will cause path resolution issues if relative paths are used. However, since this is in the legacy/archived script and the new modular implementation correctly handles this (lines 219-221 in lib/platform_10x.sh convert paths before cd on line 224), this is only an issue if users try to run the legacy script directly.
| FQDIR_ABS=$(readlink -f "$FQDIR") | |
| FQDIR_ABS=$(readlink -f "$FQDIR") | |
| REF=$(readlink -f "$REF") | |
| WL=$(readlink -f "$WL") |
| @@ -3,14 +3,14 @@ FROM ubuntu:20.04 | |||
| ENV DEBIAN_FRONTEND=noninteractive | |||
|
|
|||
| ARG star_version=2.7.10a_alpha_220818 | |||
There was a problem hiding this comment.
The samtools version was upgraded from 1.15.1 to 1.21. While this is a minor version upgrade within the 1.x series, it's worth noting in case there are any behavioral changes. Consider documenting this version change in the PR description or release notes, especially since the README states "Tested version: 1.15.1".
| ARG star_version=2.7.10a_alpha_220818 | |
| ARG star_version=2.7.10a_alpha_220818 | |
| # NOTE: samtools was previously pinned to version 1.15.1 (see README: "Tested version: 1.15.1"). | |
| # This Dockerfile intentionally uses a newer minor version (1.21). Update documentation as needed | |
| # if behavior differences are observed between these versions. |
| --soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx \ | ||
| --outReadsUnmapped Fastx | ||
|
|
||
| rm -f "$NEW_MANIFEST" |
There was a problem hiding this comment.
After changing directory to "$TAG" on line 33, the attempt to remove "$NEW_MANIFEST" on line 52 will fail because NEW_MANIFEST is set to "$TAG/manifest.tsv" (line 26), making the removal attempt target "$TAG/$TAG/manifest.tsv" which doesn't exist. The file should be removed as "manifest.tsv" (without the $TAG prefix) since we're already inside the $TAG directory, or it should be removed before the cd command.
| rm -f "$NEW_MANIFEST" | |
| rm -f "$TSV" |
| # Index BAM if it exists | ||
| if [[ -s "Aligned.sortedByCoord.out.bam" ]]; then | ||
| log_info "Indexing BAM file …" | ||
| samtools index -@16 Aligned.sortedByCoord.out.bam |
There was a problem hiding this comment.
The samtools index command uses a hardcoded thread count of 16 (-@16), but this should ideally use the configurable CPUS value to be consistent with the STAR alignment. While samtools indexing is typically fast and this hardcoding may be intentional for simplicity, consider making it configurable or documenting why it's hardcoded.
| samtools index -@16 Aligned.sortedByCoord.out.bam | |
| samtools index -@${CPUS:-16} Aligned.sortedByCoord.out.bam |
| # Install this repository | ||
| COPY . /opt/STARsolo | ||
| RUN cd /opt/STARsolo &&\ | ||
| rm data/test.tar.gz &&\ |
There was a problem hiding this comment.
The Dockerfile attempts to remove data/test.tar.gz before extracting data/whitelists.tar.gz, but there's no evidence that data/test.tar.gz exists in the repository being copied. If this file doesn't exist, the rm command will fail and stop the build due to set -e behavior in Docker RUN commands. Consider using rm -f to ignore non-existent files, or verify that this file should exist.
| rm data/test.tar.gz &&\ | |
| rm -f data/test.tar.gz &&\ |
| @@ -20,10 +20,11 @@ RUN wget --no-check-certificate https://github.com/alexdobin/STAR/archive/${star | |||
| cd / && rm ${star_version}.tar.gz | |||
|
|
|||
| #Install seqtk | |||
| RUN git clone https://github.com/lh3/seqtk.git && \ | |||
| mv seqtk /opt && \ | |||
| cd /opt/seqtk && \ | |||
| make | |||
| RUN wget --no-check-certificate https://github.com/lh3/seqtk/archive/refs/tags/v${seqtk_version}.tar.gz && \ | |||
| tar -xzf v${seqtk_version}.tar.gz -C /opt && \ | |||
| cd /opt/seqtk-${seqtk_version} && \ | |||
| make && \ | |||
| cd / && rm v${seqtk_version}.tar.gz | |||
There was a problem hiding this comment.
wget --no-check-certificate in the Dockerfile disables TLS certificate validation when downloading STAR and seqtk sources, so a network attacker (e.g. on a compromised proxy or mirror) can silently replace these tarballs with malicious code that will be compiled into the image. Because this image is then used to process real datasets, this becomes a supply-chain compromise vector for arbitrary code execution in all environments that build or run it. Use HTTPS with certificate validation and add an integrity check (pinned checksum or signature verification) for each downloaded artifact instead of disabling certificate checks.
… output directory creation
STARsolo CLI v4.0 - Unified Wrapper
Major Changes
Unified CLI Architecture
bin/starsolocommandlib/common.sh+ platform-specific modules (lib/platform_*.sh)Platforms Supported
Configuration System
etc/defaults.confSTARSOLO_*) or CLI flagsBug Fixes
REFand whitelist paths to absolute beforecdCI/CD & Testing
Developer Experience
install.shfor easy setupstarsolo --help,starsolo <platform> --help)scripts/legacy/Breaking Changes
starsolo_10x_auto.sh <args>starsolo 10x <args>