Releases: tanaylab/misha
5.6.23
Input-format ergonomics for BED/GFF/VCF
- Improved error messages: "start exceeds or equals to end" now mentions misha's 0-based half-open convention and the GFF/VCF 1-based hint; "chromosome does not exist" lists known chromosomes and points to
CHROM_ALIAS. - C++ converter now emits an R warning ("N intervals had start == end and were extended by 1bp") when zero-length intervals from a loaded file are auto-bumped — previously this happened silently.
- Added
gintervals.import_bed(),gintervals.import_gff(),gintervals.import_vcf()for direct import from common interval file formats. All three normalize chromosome names via the existingCHROM_ALIASmechanism (sochr1↔1works), apply misha's 0-based half-open convention (subtracting 1 from start for the 1-based GFF/GTF/VCF inputs), and preserve common metadata columns (name/score/strandfor BED;type/source/score/attrsfor GFF;id/ref/alt/qual/filter/infofor VCF).
Character strand input (also bundled, was 5.6.22)
- Intervals'
strandcolumn now accepts character ("+","-",".","*","") or factor input in addition to numeric1/-1/0. Strings are normalized to the numeric convention at the R→C++ boundary; output stays numeric.
5.6.19
- Fixed
gintervals.loadfailing with "invalid columns definition" aftergintervals.saveof a bigset whose input had characterchrom(e.g. a tibble fromdplyr). On-disk per-chromosome files and the.metazeroline now both storechrom/chrom1/chrom2as factor with full ALLGENOME levels, and the on-disk frame is normalized to plaindata.frame. (#102)
5.6.18
- Added
getOption("gmultitasking.strategy")forgextract(default"auto"). When the workload is large and many-track,autoroutes to a track-parallel mode (eachparallel::mclapplyworker handles a track subset across all tiles) instead of the legacy tile-parallel mode (each fork-kid handles a tile range across all tracks). On the realistic 3,110 motif tracks × 2.19M tiled_peaks workload measured 57.6 min vs ~3.4 h projected for tile-parallel — a 3.5× per-track speedup. Override per-call viaoptions(gmultitasking.strategy = "tracks" | "tiles" | "auto"). The heuristic stays on"tiles"for streaming iterators (numeric / NULL / 2D rect / track-name), single-track or fewer than 8 tracks, file/intervals.set.out output, or 2D band — so nothing else regresses (validated by a 36-cell matrix bench across iterator types × track counts × cache states).
5.6.17
Performance regression fix (vs v5.6.11–v5.6.16)
gextract calls touching many dense tracks (e.g. ~50 motif tracks) became 10–20× slower starting in v5.6.11. Two compounding causes:
MmapFileusedMAP_POPULATE, eagerly paging in every mapped track at every chromosome transition (already covered byMADV_SEQUENTIAL).- The two track-validation loops in
create_expr_iteratorandTrackExpressionVars::initwere callingGenomeTrackFixedBin::init_read()once per chromosome per track on every gextract call, paying open + mmap + madvise + close + munmap each time even though they only needed bin size and file size. Replaced with a metadata-only path that stat()s for size and reads bin_size only once per track.
Net effect on a realistic workload (51 LSE motif vtracks × 7000 tiles × 5 chroms): 22s → 0.4s (~55× speedup, also faster than pre-audit baseline).
Added an opt-in performance regression test (MISHA_PERF_TESTS=true R -e "devtools::test(filter='perf-regression')") gated out of the parallel test suite.
5.6.15
Bug fixes
- Fixed
gsynth.train(),gsynth.sample(), andgsynth.random_seqs()silently reading sequences from the wrong chromosome when theintervalsargument covered a subset of the genome that omitted one or more earlier chromosomes in the chromkey. For every chromosome in the input that came after a missing one, the C++ side opened the wrong chromosome's sequence (shifted by the number of earlier missing chromosomes), producing invalid models and corrupted sampled genomes without any error. Calls that passedintervals = gintervals.all()or leftintervalsat its default (which isgintervals.all()) were not affected. Users who ran these functions on custom interval subsets should re-run them with this version.
5.6.11
What's New
- Added
ggenome.implant()for replacing intervals in a reference genome with donor sequences and writing a new FASTA. Supports literal donor sequences or extraction from a misha database, with optional trackdb creation. - Added
ggenome.transplant()as sugar for cross-genome sequence swaps — extracts from a source genome and implants into a target genome in a single call.
5.6.7
- PWM edit distance virtual track functions (
pwm.edit_distance,pwm.edit_distance.pos,pwm.max.edit_distance,pwm.edit_distance.lse,pwm.edit_distance.lse.pos) gseq.pwm_edits()for detailed per-edit information- Pigeonhole pre-filter for PWM edit distance genome-wide scans
- Sub-chromosome range splitting for
gscreen,gextract,gsummary,gdist, andgcor(Pearson) - Fixed
gscreenreturning split intervals at sub-chromosome parallel boundaries
5.6.6
5.6.1
Changes
-
gsynth.save()andgsynth.load()now use the cross-platform.gsmformat (YAML metadata + binary arrays) instead of R-specific RDS. Models saved with pymisha can now be loaded in R and vice versa. Legacy RDS files are still supported for backward compatibility. -
Added
compressparameter togsynth.save()to optionally save as a ZIP archive. -
Added
gsynth.convert()to convert legacy RDS model files to the new.gsmformat. -
Fixed
gdb.create_genome()example to use\dontruninstead of\donttestto prevent R CMD check failures when S3 download times out.
5.6.0
-
Added
gintervals.attr.get(),gintervals.attr.set(),gintervals.attr.export(), andgintervals.attr.import()for managing interval set attributes. Attributes are stored as.iattrbinary files (null-separated key/value pairs) next to.intervfiles for small interval sets, or inside the directory for big interval sets. -
gintervals.rm()now cleans up companion.iattrattribute files when deleting interval sets.