Skip to content

Releases: tanaylab/misha

5.6.23

29 Apr 07:47
da59284

Choose a tag to compare

Input-format ergonomics for BED/GFF/VCF

  • Improved error messages: "start exceeds or equals to end" now mentions misha's 0-based half-open convention and the GFF/VCF 1-based hint; "chromosome does not exist" lists known chromosomes and points to CHROM_ALIAS.
  • C++ converter now emits an R warning ("N intervals had start == end and were extended by 1bp") when zero-length intervals from a loaded file are auto-bumped — previously this happened silently.
  • Added gintervals.import_bed(), gintervals.import_gff(), gintervals.import_vcf() for direct import from common interval file formats. All three normalize chromosome names via the existing CHROM_ALIAS mechanism (so chr11 works), apply misha's 0-based half-open convention (subtracting 1 from start for the 1-based GFF/GTF/VCF inputs), and preserve common metadata columns (name/score/strand for BED; type/source/score/attrs for GFF; id/ref/alt/qual/filter/info for VCF).

Character strand input (also bundled, was 5.6.22)

  • Intervals' strand column now accepts character ("+", "-", ".", "*", "") or factor input in addition to numeric 1/-1/0. Strings are normalized to the numeric convention at the R→C++ boundary; output stays numeric.

5.6.19

28 Apr 09:58
ba88e19

Choose a tag to compare

  • Fixed gintervals.load failing with "invalid columns definition" after gintervals.save of a bigset whose input had character chrom (e.g. a tibble from dplyr). On-disk per-chromosome files and the .meta zeroline now both store chrom/chrom1/chrom2 as factor with full ALLGENOME levels, and the on-disk frame is normalized to plain data.frame. (#102)

5.6.18

28 Apr 05:59
6402aac

Choose a tag to compare

  • Added getOption("gmultitasking.strategy") for gextract (default "auto"). When the workload is large and many-track, auto routes to a track-parallel mode (each parallel::mclapply worker handles a track subset across all tiles) instead of the legacy tile-parallel mode (each fork-kid handles a tile range across all tracks). On the realistic 3,110 motif tracks × 2.19M tiled_peaks workload measured 57.6 min vs ~3.4 h projected for tile-parallel — a 3.5× per-track speedup. Override per-call via options(gmultitasking.strategy = "tracks" | "tiles" | "auto"). The heuristic stays on "tiles" for streaming iterators (numeric / NULL / 2D rect / track-name), single-track or fewer than 8 tracks, file/intervals.set.out output, or 2D band — so nothing else regresses (validated by a 36-cell matrix bench across iterator types × track counts × cache states).

5.6.17

26 Apr 16:51
eb30be9

Choose a tag to compare

Performance regression fix (vs v5.6.11–v5.6.16)

gextract calls touching many dense tracks (e.g. ~50 motif tracks) became 10–20× slower starting in v5.6.11. Two compounding causes:

  • MmapFile used MAP_POPULATE, eagerly paging in every mapped track at every chromosome transition (already covered by MADV_SEQUENTIAL).
  • The two track-validation loops in create_expr_iterator and TrackExpressionVars::init were calling GenomeTrackFixedBin::init_read() once per chromosome per track on every gextract call, paying open + mmap + madvise + close + munmap each time even though they only needed bin size and file size. Replaced with a metadata-only path that stat()s for size and reads bin_size only once per track.

Net effect on a realistic workload (51 LSE motif vtracks × 7000 tiles × 5 chroms): 22s → 0.4s (~55× speedup, also faster than pre-audit baseline).

Added an opt-in performance regression test (MISHA_PERF_TESTS=true R -e "devtools::test(filter='perf-regression')") gated out of the parallel test suite.

5.6.15

19 Apr 11:23

Choose a tag to compare

Bug fixes

  • Fixed gsynth.train(), gsynth.sample(), and gsynth.random_seqs() silently reading sequences from the wrong chromosome when the intervals argument covered a subset of the genome that omitted one or more earlier chromosomes in the chromkey. For every chromosome in the input that came after a missing one, the C++ side opened the wrong chromosome's sequence (shifted by the number of earlier missing chromosomes), producing invalid models and corrupted sampled genomes without any error. Calls that passed intervals = gintervals.all() or left intervals at its default (which is gintervals.all()) were not affected. Users who ran these functions on custom interval subsets should re-run them with this version.

5.6.11

15 Apr 12:19
e35b6da

Choose a tag to compare

What's New

  • Added ggenome.implant() for replacing intervals in a reference genome with donor sequences and writing a new FASTA. Supports literal donor sequences or extraction from a misha database, with optional trackdb creation.
  • Added ggenome.transplant() as sugar for cross-genome sequence swaps — extracts from a source genome and implants into a target genome in a single call.

5.6.7

25 Mar 11:42

Choose a tag to compare

  • PWM edit distance virtual track functions (pwm.edit_distance, pwm.edit_distance.pos, pwm.max.edit_distance, pwm.edit_distance.lse, pwm.edit_distance.lse.pos)
  • gseq.pwm_edits() for detailed per-edit information
  • Pigeonhole pre-filter for PWM edit distance genome-wide scans
  • Sub-chromosome range splitting for gscreen, gextract, gsummary, gdist, and gcor (Pearson)
  • Fixed gscreen returning split intervals at sub-chromosome parallel boundaries

5.6.6

19 Mar 22:32
08c4602

Choose a tag to compare

  • Replaced non-API C entry point Rf_findVar with R_getVar/R_getVarEx for R 4.6.0 compatibility.
  • Fixed CRAN check NOTE about non-standard top-level files.
  • Fixed CRAN check WARNING about pipe.Rd documentation mismatch.

5.6.1

13 Mar 10:33

Choose a tag to compare

Changes

  • gsynth.save() and gsynth.load() now use the cross-platform .gsm format (YAML metadata + binary arrays) instead of R-specific RDS. Models saved with pymisha can now be loaded in R and vice versa. Legacy RDS files are still supported for backward compatibility.

  • Added compress parameter to gsynth.save() to optionally save as a ZIP archive.

  • Added gsynth.convert() to convert legacy RDS model files to the new .gsm format.

  • Fixed gdb.create_genome() example to use \dontrun instead of \donttest to prevent R CMD check failures when S3 download times out.

5.6.0

09 Mar 15:27

Choose a tag to compare

  • Added gintervals.attr.get(), gintervals.attr.set(), gintervals.attr.export(), and gintervals.attr.import() for managing interval set attributes. Attributes are stored as .iattr binary files (null-separated key/value pairs) next to .interv files for small interval sets, or inside the directory for big interval sets.

  • gintervals.rm() now cleans up companion .iattr attribute files when deleting interval sets.