Skip to content

Add configuration system, documentation, VAMB compatibility layer, and test suite#9

Draft
Copilot wants to merge 8 commits intomainfrom
copilot/improve-pipeline-documentation
Draft

Add configuration system, documentation, VAMB compatibility layer, and test suite#9
Copilot wants to merge 8 commits intomainfrom
copilot/improve-pipeline-documentation

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jan 28, 2026

Pipeline Improvements Complete ✅

This PR implements comprehensive improvements to the atavide_lite pipeline to make it easier to run, debug, and reproduce across HPC systems, without redesigning the architecture.

Phase 1: Documentation + Configuration ✅

  • Create config/paths.env.example - Comprehensive configuration template
  • Create config/samples.tsv.example - Sample sheet examples
  • Create docs/directory_contract.md - Input/output docs for each step
  • Create docs/known_good_versions.md - Software version tracking
  • Create docs/compat.md - Compatibility issue documentation
  • Create docs/dev_notes.md - Developer guidance and code standards
  • Update README.md - Quick start guide and documentation links

Phase 2: Shared Helper Library ✅

  • Create lib/common.sh - Reusable Bash helper functions
  • Create lib/README.md - Usage documentation and migration guide
  • Update .gitignore - Allow lib/ directory
  • Create pawsey_shortread/fastp_enhanced.slurm - Example using helpers

Phase 3: VAMB Compatibility ✅

  • Enhance bin/vamb_create_fasta_clusters.py - Canonical, version-independent script
  • Deprecate bin/vamb_create_fasta.py - Add warnings and version detection
  • Document VAMB compatibility issues in docs/compat.md

Phase 4: Quality Gates ✅

  • Document shellcheck usage and standards in docs/dev_notes.md

Phase 5: Test Suite ✅

  • Create tests/test_common.sh - Test all lib/common.sh functions (25 tests)
  • Create tests/test_vamb_create_fasta_clusters.py - Test VAMB script (17 tests)
  • Create tests/test_config_files.sh - Validate configuration files (19 tests)
  • Create tests/run_tests.sh - Main test runner
  • Create tests/README.md - Test suite documentation
  • Add .github/workflows/test.yml - GitHub Actions CI

Phase 6: Sync with Main ✅

  • Merged latest changes from main branch
  • Verified all tests still pass after merge

Summary ✅

  • Create IMPROVEMENTS_SUMMARY.md - Complete implementation overview

Test Suite Coverage

Total: 61 test cases across 3 test suites

  1. Bash Helper Functions (25 tests)

    • Logging, validation, file processing
    • HPC system detection
    • Configuration loading
  2. VAMB Script (17 tests)

    • Basic clustering functionality
    • Size filtering, verbose mode
    • Error handling
  3. Configuration Files (19 tests)

    • Syntax validation
    • Required variables
    • Format validation

All tests passing ✅ (verified after merge with main)

Key Benefits

  1. Easier Setup - Configuration templates with clear placeholders
  2. Better Debugging - Directory contract shows expected inputs/outputs
  3. More Reproducible - Version tracking and compatibility docs
  4. Robust Scripts - Shared helper library with fail-fast validation
  5. VAMB Stability - Version-independent canonical script
  6. Clear Standards - Developer guidance for code quality
  7. Automated Testing - Comprehensive test suite with CI
  8. Up to Date - Merged with latest main branch changes
  9. Backward Compatible - All existing scripts continue to work

Files Created (19 new files + 3 modified)

  • 2 config examples
  • 5 documentation files
  • 2 lib helper files + 1 enhanced example script
  • 1 implementation summary
  • 5 test files + 1 CI workflow
  • Plus 3 modified files (README, .gitignore, VAMB scripts)

See IMPROVEMENTS_SUMMARY.md and tests/README.md for complete details.

Original prompt

Copilot Instructions — atavide_lite

You are GitHub Copilot working in the atavide_lite repository. Your goal is to make the pipeline easier to run, easier to debug, and more reproducible across HPC systems, without converting it to Snakemake and without large architectural rewrites.

Guiding principles

  1. Do not redesign the pipeline (no Snakemake/Nextflow rewrite).
  2. Prefer small, composable changes that improve:
    • reproducibility (version/DB pinning)
    • portability (shared config format)
    • clarity (directory contracts / expected outputs)
    • robustness (fail-fast checks, helpful errors)
  3. Keep cluster-specific scripts, but standardize their interface via shared config and consistent conventions.
  4. Make changes that are friendly to HPC usage:
    • Bash + Slurm
    • environment modules and/or conda/venv
    • Apptainer/Singularity optional, not mandatory

Phase 1 — Documentation + “front door” config (highest priority)

A) Add a single canonical configuration format

Create a new directory:

  • config/
    • paths.env.example — shell KEY=VALUE entries that can be source’d
    • samples.tsv.example — example samplesheet for paired and single-end runs

Requirements for paths.env.example:

  • must include placeholders for:
    • scratch/work directory root
    • input fastq directory (or allow per-sample absolute paths in samples.tsv)
    • host reference (if used)
    • mmseqs databases (UniRef or other)
    • BV-BRC mapping files/snapshots (if used)
    • threads, memory defaults, tmp directory
  • include comments explaining each variable
  • do not include real paths; use obvious placeholders

Requirements for samples.tsv.example:

  • tab-delimited
  • include a header row
  • include both examples:
    • paired-end: sample_id, r1, r2
    • single-end: sample_id, r1
  • allow absolute paths
  • include optional columns (can be empty): host_ref, group, notes

B) Add “directory contract” documentation

Create docs/directory_contract.md documenting, for each step:

  • required inputs
  • outputs (filenames + directory layout)
  • how to validate success (e.g., files exist + non-empty + key metrics file)
  • typical resource notes (threads/mem)
  • failure symptoms and where to look (stderr, logs)

Keep it concise but explicit.

C) Add “known-good versions” documentation

Create docs/known_good_versions.md with:

  • a table listing the major tools and tested versions:
    • fastp
    • minimap2
    • samtools
    • mmseqs2
    • megahit
    • vamb
    • python version (for helper scripts)
  • a section for database snapshots (names + how to record version/date)

Include a short note:

  • this is not strict, but serves as a tested baseline for reproducibility.

D) Update README to point to the above

Update README.md:

  • add a “Quick start (conceptual)” section with:
    1. copy example config files
    2. edit paths
    3. run the appropriate cluster script
  • link to:
    • docs/directory_contract.md
    • docs/known_good_versions.md
    • config examples

Phase 2 — Standardize script interfaces (medium priority)

A) Add a shared Bash helper: lib/common.sh

Create lib/common.sh and ensure cluster scripts can source it.

It must provide:

  • die() — print error to stderr and exit non-zero
  • log() — timestamped logging
  • require_cmd() — assert a command exists in PATH
  • require_file() — assert a file exists and is readable
  • require_dir() — assert a directory exists (or create it if appropriate)
  • check_nonempty() — assert output exists and size > 0
  • load_config() — source config/paths.env if present, else error with guidance

Do not introduce external dependencies beyond POSIX/Bash.

B) Add consistent safe Bash practices in scripts

Where appropriate:

  • set -euo pipefail
  • trap with line number and command (trap '...' ERR)
  • quote variables
  • use find ... -print0 + while IFS= read -r -d '' patterns for filenames

Do not rewrite every script at once; start with 1–2 representative ones.


Phase 3 — Fix/contain VAMB script drift (high priority because it breaks runs)

A) Establish a single canonical VAMB helper script

Identify the two similar scripts (e.g., vamb_create_fasta.py and vamb_create_fasta_clusters.py).

Actions:

  1. Pick one as canonical (prefer the more general one).
  2. Deprecate the other:
    • keep it but add a top-of-file comment: “DEPRECATED — use ”
    • or replace it with a small wrapper calling the canonical script.

B) Add a compatibility layer for VAMB version differences

In the canonical script:

  • detect VAMB version using one of:
    • import vamb; vamb.__version__ (preferred if available)
    • python -c "import vamb; print(vamb.__version__)" in wrappers
  • adapt to known argument changes (e.g., minsize vs removed/renamed args)
  • if version detection fails, fail fast with a helpful message:
    • “Tested with VAMB X.Y.Z; please install that or update compat logic.”

Add a short doc: docs/compat.md capturi...


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 5 commits January 28, 2026 05:17
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve pipeline documentation and configuration setup Add configuration system, documentation, and VAMB compatibility layer Jan 28, 2026
Copilot AI requested a review from linsalrob January 28, 2026 05:22
@linsalrob
Copy link
Copy Markdown
Owner

@copilot can you design a test suite for this code?

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jan 28, 2026

@copilot can you design a test suite for this code?

I've designed and implemented a comprehensive test suite for the pipeline improvements. Commit: 3e9201b

Test Suite Overview

61 test cases across 3 test suites, all passing ✅

1. Bash Helper Tests (tests/test_common.sh)

Tests all 25+ functions in lib/common.sh:

  • Logging (log, log_error, log_warn, die)
  • Validation (require_cmd, require_file, require_dir, check_nonempty, require_var)
  • File processing (count_fastq_reads, count_fasta_sequences, file_size)
  • HPC helpers (get_array_task_id, detect_scheduler, get_fast_storage)
  • Config loading and output functions

2. VAMB Script Tests (tests/test_vamb_create_fasta_clusters.py)

Tests the canonical VAMB binning script (17 tests):

  • Basic clustering, size filtering, verbose mode
  • Error handling for missing/malformed inputs
  • Output validation (compressed FASTA files)

3. Config File Tests (tests/test_config_files.sh)

Validates configuration templates (19 tests):

  • Syntax checking (sourceable without errors)
  • Required variables present
  • Tab-delimited format validation

Running Tests

cd tests
./run_tests.sh              # Run all tests
./run_tests.sh --verbose    # Detailed output

Individual suites can be run separately. See tests/README.md for details.

CI Integration

Added .github/workflows/test.yml for automated testing on push/PR.

Copilot AI changed the title Add configuration system, documentation, and VAMB compatibility layer Add configuration system, documentation, VAMB compatibility layer, and test suite Jan 28, 2026
@linsalrob
Copy link
Copy Markdown
Owner

@copilot update this branch to match main

… copilot/improve-pipeline-documentation

Co-authored-by: linsalrob <836231+linsalrob@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot April 25, 2026 23:51
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 25, 2026

@copilot update this branch to match main

Branch updated to match main. Commit: a86bed5

Merged 29 commits from main including:

  • Sankey plot improvements (ProcessPoolExecutor, threads option)
  • count_subsystems.py enhancements
  • New bin/count_subsystems_selective.py and bin/taxonomy_selected.py scripts
  • Various minor updates and bug fixes

All 61 tests still passing ✅ after merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants