Nextflow pipeline for importing GWAS mapping TSV files to a Parquet database.
This pipeline migrates raw mapping files from the NemaScan simulation pipeline into a normalized Parquet database structure optimized for downstream analysis. It provides process-level parallelism suitable for SLURM-managed HPC clusters.
Optionally, the pipeline can also run QTL analysis to flag significant markers and define QTL intervals using both Bonferroni (BF) and EIGEN-based significance thresholds.
nextflow run main.nf --input /path/to/mapping_files --output ./results/dbnextflow run main.nf \
--input /path/to/mapping_files \
--output ./results/db \
--analyze_qtlnextflow run main.nf \
--input /path/to/mapping_files \
--output \
-profile rockfish| Parameter | Description | Default |
|---|---|---|
--input |
Directory containing mapping TSV files (required) | - |
--eigen_dir |
Directory containing EIGEN files (optional) | auto-detect |
--output |
Output database directory | ./results/db |
--overwrite |
Replace existing mappings | false |
--batch_name |
Name for this batch (for reports) | input dir name |
| Parameter | Description | Default |
|---|---|---|
--analyze_qtl |
Enable QTL analysis after migration | false |
--alpha |
Significance level for thresholds | 0.05 |
--ci_size |
Markers left/right of peak for confidence interval | 150 |
--snp_grouping |
Max marker distance to group into same QTL | 1000 |
--qtl_output |
QTL output directory | {output}/qtl |
| Parameter | Description | Default |
|---|---|---|
--mapping_db |
Path to existing mapping database (required for standalone) | - |
--population |
Filter to specific population | all |
--algorithm |
Filter to INBRED or LOCO | all |
Phase 1: Discovery
DISCOVER_FILES -> BUILD_EIGEN_LOOKUP -> IDENTIFY_MARKER_SETS
| | |
v v v
mapping_files eigen_lookup marker_sets
Phase 2: Pre-create marker sets
| +--------------------+
| |
v v
WRITE_MARKER_SETS (parallel per marker set)
Phase 3: Process mappings
|
v
PROCESS_MAPPINGS (parallel per file)
Phase 4: Aggregate results
|
v
AGGREGATE_METADATA
Phase 5: QTL Analysis (optional, --analyze_qtl)
|
v
DISCOVER_QTL_BATCHES -> ANALYZE_QTL_BATCH -> AGGREGATE_QTL_RESULTS
(parallel per pop/algo)
-
Discovery Phase (local, fast)
- Scan input directory for
*_mapping.tsvfiles - Build EIGEN lookup table from
Genotype_Matrix/directory - Identify unique population+MAF combinations
- Scan input directory for
-
Marker Set Creation (parallel per marker set)
- Create marker sets before processing mappings
- Include EIGEN values for threshold calculations
-
Processing Phase (SLURM, parallel per-file)
- Each mapping file processed independently
- Outputs individual partition Parquet + status JSON
-
Aggregation Phase (single job)
- Combine status files
- Write consolidated metadata table
-
QTL Analysis Phase (optional, parallel per population/algorithm)
- Discover unique (population, algorithm) batches
- For each batch: compute BF and EIGEN thresholds, define QTL intervals
- Aggregate results into summary files
| Profile | Description |
|---|---|
standard |
Local execution (no container) |
docker |
Local execution with Docker container |
rockfish |
JHU Rockfish cluster (SLURM + Singularity) |
test |
Use test data |
nextflow run main.nf \
--input /path/to/data \
--output ./results/db \
-profile dockernextflow run main.nf \
--input /path/to/data \
--output ./results/db \
-profile rockfishresults/db/
├── markers/
│ └── {population}_{maf}_markers.parquet
├── mappings/
│ └── population={pop}/
│ └── mapping_id={id}/
│ └── data.parquet
├── mappings_metadata.parquet
├── marker_set_metadata.parquet
├── qtl/ # (when --analyze_qtl enabled)
│ ├── qtl_regions/
│ │ └── {population}_{algorithm}_qtl_regions.parquet
│ ├── analysis_summary.parquet
│ ├── analysis_metadata.parquet
│ └── qtl_analysis_summary.txt
└── pipeline_info/
├── execution_timeline_*.html
├── execution_report_*.html
└── execution_trace_*.txt
qtl_regions (one row per QTL interval):
mapping_id,threshold_method(BF/EIGEN),peak_idCHROM,startPOS,peakPOS,endPOS,interval_sizen_sig_markers,max_log10p,peak_marker,sig_threshold_valueci_size,snp_grouping,algorithm,population,maf
analysis_summary (one row per mapping × threshold):
mapping_id,threshold_method,population,maf,algorithmn_markers,n_significant,pct_significant,n_qtl,max_log10pthreshold_value,ci_size,snp_grouping
After migration, you can re-run QTL analysis with different parameters using the standalone workflow:
# Re-run with different CI size and SNP grouping
nextflow run workflows/analyze_qtl.nf \
--mapping_db results/db \
--ci_size 200 \
--snp_grouping 500
# Filter to specific population
nextflow run workflows/analyze_qtl.nf \
--mapping_db results/db \
--population ct.fullpop.20210901
# Filter to specific algorithm
nextflow run workflows/analyze_qtl.nf \
--mapping_db results/db \
--algorithm INBRED
# Custom output directory
nextflow run workflows/analyze_qtl.nf \
--mapping_db results/db \
--qtl_output results/qtl_rerunRe-running QTL analysis will overwrite existing QTL results for the same parameters.
Build the Docker container:
cd docker
docker build -t migrate-to-db:latest .
docker push yourusername/migrate-to-db:latestOn HPC, Singularity will auto-convert:
singularity pull docker://yourusername/migrate-to-db:latestnextflow run main.nf -profile test