Local automatic audio section slicer. Takes an audio file, finds the section
boundaries, groups repeated sections under shared labels (A, B, C, …),
exports each section as a WAV slice, and writes sections.json plus
sections.csv. Runs entirely on your machine — no cloud, no upload.
Requires Python 3.11+ and pip.
python -m pip install "git+https://github.com/VanKyle00/SongSlice.git"This installs the songslice CLI directly from this repo. No clone required.
git clone https://github.com/VanKyle00/SongSlice.git
cd SongSlice
python -m pip install -e ".[dev]"The [dev] extra pulls in test and benchmarking dependencies (pytest, mir_eval). Use plain python -m pip install -e . if you only want the runtime.
songslice --helpSlice an audio file into labeled sections.
songslice analyze .\song.wav --out .\exportsOptions:
| Flag | Default | Description |
|---|---|---|
--out, -o |
exports |
Output directory for slices + metadata. |
--min-section-seconds |
8.0 |
Minimum section duration. Larger values force coarser segmentation. |
--max-sections |
16 |
Hard cap on the number of sections. |
--label-groups |
auto |
Number of distinct section labels. Default auto-selects from the affinity eigengap. A forced value is capped at 8 (and at the number of detected sections). |
A run prints a one-line summary:
Detected structure: Intro B C D B E D F G H D
Exported 11 slices to .\exports
Overall confidence: 0.63
Run the local web app for upload + browser-based workflow.
songslice serve --port 8000Then open http://127.0.0.1:8000.
Score the analyzer against ground-truth annotations (SALAMI / Isophonics formats). Useful when iterating on the analyzer.
songslice bench --manifest .\bench_manifest.toml --report .\bench-report.jsonThe manifest is TOML:
[[tracks]]
id = "SALAMI_10"
audio = "C:/Music/salami/10.mp3"
annotations = "C:/datasets/salami-data-public/annotations/10/parsed/textfile1_uppercase.txt"
format = "salami"Reports per-track boundary F-measure at ±0.5 s and ±3 s plus pairwise label F-measure, then a corpus mean.
For an input song.wav, the output directory contains:
- Ordered WAV slices:
01_Intro_0m00s-0m19s.wav,02_B_0m19s-0m33s.wav, … sections.json— source file, duration, overall confidence, detected structure string, and per-section detail (label, start/end, durations, boundary + group confidences, beat-snap flag).sections.csv— one row per section, same fields.
Labels are letters A, B, C, … assigned in temporal order, with the
special label Intro when the first section is detected to be instrumental
(see below).
-
Decode + resample. Audio is loaded at 22 050 Hz mono.
-
Feature extraction. HPSS-separated harmonic + percussive components; chroma CENS (key/chord profile), MFCC (timbre), RMS, spectral centroid, and harmonic-RMS (voice-activity proxy) at hop-rate.
-
Beat tracking. Frame features synced to detected beats; downstream work happens on beat-segments.
-
Boundary detection. Foote checkerboard novelty on a combined chroma + MFCC self-similarity matrix; boundary candidates picked from novelty peaks. Candidates are re-ranked by a vocal-change score — peaks where the harmonic component's energy shifts (voice entering/leaving) get boosted.
-
Beat snapping. Each surviving boundary snaps to the nearest beat when close enough.
-
Section grouping. Each segment gets two feature blocks:
- Harmonic (chroma mean + std, 24 dims)
- Timbral (MFCC + RMS + centroid means and stds, 28 dims)
Pairwise Pearson correlation on each axis yields a harmonic and a timbral affinity. Their element-wise product — high only when BOTH axes agree — is the affinity for spectral clustering: segments are partitioned by normalized-cut clustering of the affinity's Laplacian, with the number of label groups taken from the eigengap (capped internally at 8). Pass
--label-groups Nto force a specific count. This follows McFee & Ellis, "Analyzing song structure with spectral clustering" (ISMIR 2014). -
Adjacent merge. Consecutive sections with the same label collapse into one.
-
Intro labeling. If the first section has notably lower MFCC mid-coefficient variability than the next section (voices wiggle the spectral envelope more than sustained instruments) AND the two sections are feature-different, the first section's label becomes
Intro. -
Export. WAV slices, JSON, CSV.
If the default analysis under- or over-segments a particular track:
- Too many short sections within what's perceptually one part? Raise
--min-section-seconds(try 15 or 20). Forces detected boundaries to be further apart. - Two sections wrongly grouped under the same label? Force a finer split:
set
--label-groupsone higher than the detected label count. - Repeated sections wrongly given different labels? Force a coarser
grouping: set
--label-groupsone lower. - Too many sections overall? Lower
--max-sections.
The analyzer uses classical DSP features (chroma, MFCC, RMS, spectral centroid, HPSS-derived voice proxy) plus structural clustering — not a trained section-classification model. Concrete consequences:
- Subtle verse-to-chorus transitions can be missed, especially in genres
with continuous dynamics (shoegaze, dream-pop, ambient pop). Songs where
the chorus enters by gradually layering instruments often produce a
weaker novelty peak at the section entry than at sharper mid-section
events (drum entries, chord-progression cadences). Tuning
--min-section-secondscan help; sometimes no automatic setting matches human perception exactly. - The
Introheuristic uses MFCC mid-coefficient variability as a proxy for vocal vs. instrumental content. It's not real voice-activity detection — a section with a busy lead instrument (saxophone solo, fast guitar) may register as "vocal-like" and skip theIntrorebrand. A section with a steady synth and no vocals will register as instrumental and get the rebrand. This works for the common case but isn't infallible. - Labels are structural, not semantic. A
Bsection is "everything grouped under letter B," not specifically "verse" or "chorus." Two sections sharing a label mean only that the analyzer found them musically similar; the meaning is whatever the song's actual structure is. - No support for stems or multi-track input — analysis is on the mixed audio.
- Long sections (multiple minutes) of homogeneous content may produce fewer boundaries than a human listener expects; conversely, arrangement-heavy sections with internal variation may be over-split.
The grouping step is classical spectral clustering. A learned model could do
better, especially at distinguishing functional sections (verse vs chorus)
rather than only structural similarity. The natural next step is an optional
trained analyzer behind the existing Analyzer interface, following:
Morgan Buisson, Brian McFee, Slim Essid. "Using Pairwise Link Prediction and Graph Attention Networks for Music Structure Analysis." ISMIR 2024. Code: https://github.com/morgan76/LinkSeg
That method is lightweight (<330K parameters, runs locally) and is strongest at structural grouping and section labeling — the same axis this release improves with DSP. Boundary detection would stay with the current novelty-based stage.
- Section grouping uses spectral clustering after Brian McFee and Daniel P. W. Ellis, "Analyzing song structure with spectral clustering," ISMIR 2014.
python -m pytest tests/The suite covers feature extraction edge cases, boundary detection fallbacks, two-axis discrimination (same pitch / different timbre and vice versa), adjacent-duplicate merging, intro labeling, CLI invocation, metadata writing, and the bench scorer.