Convert Nirvana/Illumina Connected Annotations JSON output to VCF 4.2 format.
- Pure Python, zero required runtime dependencies (
pysamis optional, only for FASTA-based normalization and bgzip/tabix output) - Streaming pipeline — processes one position at a time, no full-file load into memory
- Reads
.jsonand.json.gzinput; writes plain VCF or bgzipped.vcf.gz(with optional tabix index) - Supports GRCh37 and GRCh38 assemblies (auto-detected from Nirvana header)
- Allele normalization — trims shared prefix/suffix to minimal VCF representation (enabled by default); with
--reference, additionally left-shifts indels through repeats (matchesbcftools norm) - Multi-allelic decomposition — splits multi-allelic sites into biallelic rows, like
bcftools norm -m-(--decompose) - VEP-style CSQ field with per-transcript annotations
- Annotations: gnomAD, ClinVar, SpliceAI, REVEL, DANN, GERP, phyloP, 1000 Genomes, TOPMed
pip install -e . # core (orjson only)
pip install -e ".[full]" # adds pysam for --reference / .vcf.gz / --tabix# Basic conversion
nirvana2vcf -i input.json.gz -o output.vcf
# Output to stdout (pipe to bcftools, etc.)
nirvana2vcf -i input.json.gz | bcftools view -f PASS
# VEP-style CSQ only (no flat INFO fields)
nirvana2vcf -i input.json -o output.vcf --csq-only
# Omit sample/genotype columns
nirvana2vcf -i input.json.gz -o output.vcf --no-samples
# Override genome assembly (instead of auto-detecting from header)
nirvana2vcf -i input.json.gz -o output.vcf --assembly GRCh37
# Disable allele normalization (keep raw Nirvana alleles)
nirvana2vcf -i input.json.gz -o output.vcf --no-normalize
# Decompose multi-allelic sites into biallelic rows (normalization is applied by default)
nirvana2vcf -i input.json.gz -o output.vcf --decompose
# Decompose without normalization (keep raw Nirvana alleles)
nirvana2vcf -i input.json.gz -o output.vcf --decompose --no-normalize
# Reference-based left-alignment of indels (bcftools-norm parity, requires pysam + indexed FASTA)
nirvana2vcf -i input.json.gz -o output.vcf --reference GRCh38.fa
# Bgzipped output with a tabix index (requires pysam)
nirvana2vcf -i input.json.gz -o output.vcf.gz --tabix
# Show progress on long runs (every 10,000 positions, plus a final summary)
nirvana2vcf -i input.json.gz -o output.vcf --verboseEnabled by default. Trims shared prefix and suffix bases from REF and ALT alleles to produce the
minimal VCF representation, adjusting POS accordingly. This matches what tools like bcftools norm
and vt normalize do for left-trimming. Pass --reference path/to/genome.fa (requires the
[full] extra and an indexed FASTA) to additionally left-shift indels through homopolymer/STR
repeats — the result is then equivalent to bcftools norm -f genome.fa.
Nirvana sometimes emits redundant flanking bases — for example, when representing an insertion or deletion relative to a longer context sequence.
Example — SNV emitted with flanking context bases:
# Raw Nirvana output (--no-normalize):
# Nirvana emitted C→T substitution with flanking A…GT context:
chr1 1000 . ACGT ATGT . . ...
# Normalized (default):
# Phase 1 (right-trim): strip shared T → ACGT→ACG, ATGT→ATG
# strip shared G → ACG→AC, ATG→AT
# AC[-1]=C ≠ AT[-1]=T → stop
# Phase 2 (left-trim): strip shared A → AC→C, AT→T, POS advances to 1001
# len(C)=1 → stop
chr1 1001 . C T . . ...
Example — deletion with right-anchor padding:
# Raw Nirvana output (--no-normalize):
# Deletion of A, represented with trailing GT context:
chr1 1000 . ACGT CGT . . ...
# Normalized (default):
# Phase 1 (right-trim): strip T → ACGT→ACG, CGT→CG
# strip G → ACG→AC, CG→C
# len(C)=1 → stop
# Phase 2 (left-trim): len(C)=1 → stop (REF=AC, ALT=C is already minimal)
chr1 1000 . AC C . . ...
Example — SNV that needs no trimming:
# Both modes produce the same output for a clean SNV:
chr7 117548628 . A G . . ...
Symbolic alleles (<DEL>, <DUP>, etc.), reference-only ALTs (.), and spanning deletions (*)
are never modified by normalization.
Disabled by default. When enabled, a position with multiple ALT alleles is split into one VCF row
per ALT allele (like bcftools norm -m-). Variant annotations and per-allele INFO fields are scoped
to each row. Sample genotypes, allele depths (AD), and variant frequencies (VF) are remapped:
- GT: alleles matching this ALT →
1; other ALTs →.(missing); REF stays0 - AD:
[ref_depth, this_alt_depth] - VF:
[this_alt_frequency]
Example — tri-allelic site:
# Without --decompose (one row, multi-allelic):
chr1 925952 . GCACA ACACA,G . . gnomAD_AF=0.001,0.0005 GT:AD 1/2:10,5,3
# With --decompose (two rows, biallelic):
chr1 925952 . GCACA ACACA . . gnomAD_AF=0.001 GT:AD 1/.:10,5
chr1 925952 . GCACA G . . gnomAD_AF=0.0005 GT:AD ./1:10,3
When combined with --normalize (the default), decomposition runs first, then each biallelic row
is normalized independently. This means rows may end up with different POS values if their alleles
trim differently.
By default, nirvana2vcf writes flat INFO fields for each annotation type (gnomAD_AF, ClinVar_SIG,
SpliceAI_DS_AG, etc.). With --csq-only, all transcript-level annotations are packed into a
single CSQ INFO field in the same pipe-delimited format used by Ensembl VEP, and the flat fields
are omitted.
Use --csq-only when your downstream tool (e.g. a variant database loader) expects VEP-annotated
VCFs and parses the CSQ field directly.
# Default flat fields:
INFO=gnomAD_AF=0.0032;ClinVar_SIG=Pathogenic;SpliceAI_DS_AG=0.85;REVEL=0.92
# With --csq-only:
INFO=CSQ=ENST00000357654|missense_variant|MODERATE|BRCA1|...|0.92|...
The CSQ format string is written to the VCF header (##INFO=<ID=CSQ,...,Format="Allele|...">).
By default, all samples from the Nirvana JSON are written as genotype columns in the VCF. Use
--no-samples to produce a sites-only VCF with no FORMAT or sample columns — useful for annotation
databases or tools that do not expect genotype data.
# Default (with samples):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
chr1 925952 . A G . . gnomAD_AF=0.001 GT:DP 0/1:30 0/0:25
# With --no-samples:
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 925952 . A G . . gnomAD_AF=0.001
Assembly is auto-detected from the Nirvana JSON header (the genomeAssembly field). Use
--assembly GRCh37 or --assembly GRCh38 to override this — for example, if the header is
missing or wrong.
The assembly affects contig ##contig header lines and chromosome naming:
- GRCh38:
chr1,chr2, ...,chrX,chrY,chrM - GRCh37:
1,2, ...,X,Y,MT(nochrprefix)
pip install -e ".[dev]"
python3 -m pytest -v # run all tests
python3 -m pytest -v -k csq # run tests matching keywordStreaming pipeline: parse → map → write
nirvana2vcf/parser.py— Streams Nirvana's line-based JSON format, yielding(NirvanaHeader, Position)tuplesnirvana2vcf/mapper.py— Transforms positions into VCF record dicts (per-allele fields, CSQ, INFO escaping, normalization, decomposition)nirvana2vcf/vcf_writer.py— Writes VCF 4.2 plain textnirvana2vcf/models.py— Dataclass contracts between parser and mappernirvana2vcf/constants.py— VCF header definitions, contig maps, CSQ field names
Nirvana emits a line-based streaming format (not standard JSON):
- Line 1:
{"header":{...},"positions":[ - Lines 2–N: one position object per line (comma-separated)
- Last line:
],"genes":[...]}
Each position line maps directly to one VCF row (or multiple rows when using --decompose).
The gnomAD_EUR_AF INFO field is sourced from Nirvana's gnomad.nfeAf
(European non-Finnish) — not gnomad.eurAf, which is a 1000 Genomes field
that never appears inside Nirvana's gnomad block. The INFO header
description reflects this ("gnomAD allele frequency (European non-Finnish)").
The name gnomAD_EUR_AF (rather than gnomAD_NFE_AF) is kept deliberately
for stability with downstream consumers that expect an EUR suffix. When
comparing against another VCF, map this field against the other source's
NFE column, not an EUR column.
Nirvana's gnomad block combines gnomAD genomes + exomes from the
version bundled with your Nirvana data files (v2.1 for the release used
in validation/docs/phases/phase05.md).
When cross-checking against a standalone gnomAD VCF, match both the
version and the genomes/exomes scope — otherwise concordance numbers
will be misleading. See phase05.md
for a worked example.
Nirvana's polyPhenPrediction is trained on HumanVar (HVAR), not
HumanDiv (HDIV). When comparing PolyPhen (from the CSQ field) against
dbNSFP, use dbNSFP_Polyphen2_HVAR_pred, not _HDIV_pred. HDIV leans
more damaging and HVAR leans more benign, so mismatching the flavor
produces a strongly asymmetric B vs D/P disagreement pattern.
Open-source validation using only public data and open-source tools
lives under validation/. See
validation/docs/strategy.md for the
five-phase strategy and validation/results/ for
concordance reports against bcftools, VEP, and SnpEff on Nirvana's
bundled 10,000-variant HiSeq test VCF.
Issues and pull requests are welcome — see CONTRIBUTING.md for development setup and guidelines.
MIT — see LICENSE.