Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 80 additions & 3 deletions SKILL.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,21 @@
---
name: clipify
description: Find the funniest moments in a video, cut them as standalone clips, optionally reformat 16:9 9:16 (face-pan or split-screen), and burn opus-style word-by-word captions. Use when the user mentions "clipify," "cut clips from this video," "make shorts from this," "find funny moments," "reframe to 9:16," "vertical clips," or pastes a video file path and wants social-ready cuts.
description: Find compelling moments in a video — funny dialogue OR repeated impact actions like axe chops, hits, throws, drumbeats — cut them as standalone clips, optionally reformat 16:9 9:16, time-warp pans for tighter reveals, and burn opus-style word-by-word captions. Use when the user mentions "clipify," "cut clips from this video," "make shorts from this," "find funny moments," "action montage," "cut on each chop/hit/punch," "reframe to 9:16," "vertical clips," or pastes a video file path and wants social-ready cuts.
---

# Clipify

Find the funniest moments in a video, cut them as standalone clips, optionally reformat 16:9 → 9:16 (face-pan or split-screen), and burn opus-style word-by-word captions.
Find compelling moments in a video, cut them as standalone clips, optionally reformat 16:9 → 9:16 (face-pan or split-screen), time-warp pans for tighter reveals, and burn opus-style word-by-word captions.

## Modes

Pick one before Step 1. The choice changes how you find moments and whether captions apply.

- **Dialogue mode** (default for podcasts, interviews, talking-heads): Whisper transcribes, you scan for punchlines / reactions / awkward pauses. → Step 1A. Captions in Step 5.
- **Action mode** (woodchopping, sports, percussion, drumming, anything with repeated impact sounds): skip Whisper entirely, run `detect_transients.py` to find each strike, cut on each one. → Step 1B. Skip Step 5; the strike sounds *are* the rhythm.
- **Hybrid**: dialogue mode for talking sections, action mode for the rest, intercut. Use both 1A and 1B.

If the user says "I'm not talking in this," you're in action mode. If they say "cut on each X" (chop, hit, punch, drumbeat, swing, footstep), you're in action mode. Default to dialogue mode otherwise.

## Inputs

Expand All @@ -23,14 +33,15 @@ Find the funniest moments in a video, cut them as standalone clips, optionally r
- `build_pan.py` — ffmpeg crop x-expression with hard cuts
- `build_ass.py` — opus-style ASS captions from whisper JSON
- `audio_align.py` — find offset of a sub-clip in a longer source
- `detect_transients.py` — find sharp impact sounds (action mode); see `--help`

Working dir: `/tmp/clipify/` (mkdir at start, leave artifacts for debugging).

---

## Workflow

### Step 1 — Find the funniest parts
### Step 1A — Find dialogue moments (dialogue mode)

```bash
mkdir -p /tmp/clipify
Expand All @@ -48,6 +59,45 @@ Read the resulting JSON (or `.txt`) and pick 3–5 candidate clips. Funny signal

For each candidate, propose: `[start, end, why-it's-funny, suggested title]`. Aim for 10–25s clips. Show the list and let the user confirm/pick.

### Step 1B — Find each strike (action mode)

Replace Whisper with audio-transient detection. The detector finds sharp impacts (axe-on-wood, fist-on-pad, drumhead, ball-on-bat, wood landing on a pile) by computing spectral flux on the 1–6 kHz band — the broadband impulse an impact creates that ambient/wind doesn't.

```bash
ffmpeg -y -hwaccel videotoolbox -i "$VIDEO" -vn -ac 1 -ar 16000 /tmp/clipify/audio.wav
python3 <skill-dir>/scripts/detect_transients.py /tmp/clipify/audio.wav \
--min-flux 30 --min-gap 0.6 > /tmp/clipify/strikes.json
```

Tuning:

- `--min-flux 30` is the sane default. Real impacts register 50–300; ambient/wind sits at 5–15. Too many false positives → raise to 40–50. Too few hits → drop to 20.
- `--min-gap 0.6` (seconds) is the closest two strikes can be. Fast drumming may need 0.2; chopping is fine at 0.6–1.0.
- `--band 1000:6000` covers wood/metal/glass impacts. Heavy thuds (kick drum, body shots) live lower (`--band 200:2000`); whistles/clinks/snare cracks higher (`--band 4000:10000`).
- For multi-clip sources (a folder of phone clips), run the detector on each clip and pick a varied set across angles. Aim for 25–40 strikes for a 45–60s montage at ~0.9s per shot.

Each "shot" should start ~0.4s before the strike (showing wind-up) and end ~0.5s after (strike + brief follow-through). That's ~0.9s per beat.

### Step 1.5 — Verify each candidate is in frame

**Always do this before rendering**, in either mode. Audio detection finds the *sound* of an event; the camera might not have captured it. Skipping this is the #1 way to get user feedback like "you cut on chops where I'm not even visible."

```bash
# Sample one frame at each candidate strike time T:
for T in $STRIKE_TIMES; do
ffmpeg -y -ss "$T" -i "$VIDEO" -frames:v 1 -vf scale=320:-1 \
/tmp/clipify/verify_${T}.jpg
done
```

Read each verify image and drop the candidate if:

- The subject is bent down, off-frame, or behind an obstacle
- A thumb is on the lens (yes, this happens — and you only catch it visually)
- It's clearly a different sound source (door slam, dropped tool) that triggered the detector

For ~30 candidates this is one fast `ffmpeg` call each plus a single batch image read. Cheap insurance.

### Step 2 — Trim each chosen clip

```bash
Expand Down Expand Up @@ -125,8 +175,32 @@ Two stacked tiles, 1080×960 each. The active speaker's tile is on top — overl

Build `<RIGHT_SPEAKER_ENABLE>` from `segments.json` as `between(t,a,b)+between(t,a,b)+...` over the right-speaker segments. Tile crops should target ~720×640 around each face (1.125:1 to match 1080×960).

### Step 4c — Time-warping reveals (optional)

When the user wants a slow scenic / pan / reveal shot tightened ("cut this in half") without losing the arc, speed-ramp instead of trimming. Trimming forces you to drop part of the arc; speed-ramping preserves all the beats at higher tempo.

```bash
# 2x speed: 12s arc → 6s output, audio pitch preserved via atempo
ffmpeg -y -ss "$START" -i "$CLIP" -t "$ORIG_DUR" \
-vf "setpts=PTS/2,scale=1080:1920:flags=lanczos,fps=30" \
-af "atempo=2.0" \
-c:v libx264 -preset fast -crf 21 -pix_fmt yuv420p \
-c:a aac -b:a 128k /tmp/clipify/clip_2x.mp4
```

Rules of thumb:

- **1.5x** — human movement that should feel "slightly brisk" without looking sped up
- **2x** — punchier reveal; still reads as natural
- **3x–4x** — time-lapse vibe (chain `atempo=2.0,atempo=2.0` for 4x; `atempo` accepts only 0.5–2.0 per filter instance)
- Drop `-af atempo` and mute the segment if the audio is just ambient/wind and the chipmunking would be distracting

Useful when the rest of the cut is rhythmic (chop montage, chat back-and-forth) and the reveal would otherwise feel like a dead spot.

### Step 5 — Add subtitles

Skip this step in action mode — the strike sounds are the rhythm and captions just clutter the visual.

Ask once (only if user hasn't already specified a style):

> "Three subtitle styles: **opus** (big bold white, yellow active-word highlight), **karaoke** (4-word chunks, green highlight), **minimal** (clean Helvetica, no highlight). Or paste an example you like."
Expand Down Expand Up @@ -159,6 +233,9 @@ ffmpeg -y -i /tmp/clipify/clip_panned.mp4 -vf "subtitles=/tmp/clipify/captions.a

## Pitfalls (lessons from prior runs — don't repeat)

- **Audio detection ≠ visible event.** Always run Step 1.5 (verify each candidate frame) before rendering. The detector finds the *sound* of a chop, not whether the chopper is in frame. Hits where the subject is bent over, off-camera, or where a thumb is on the lens still trigger the audio detector. Catch them before rendering.
- **Spectral-flux band matters.** Default `--band 1000:6000` covers wood/metal/glass impacts. Heavy low thuds (kick drum, body shots) need `--band 200:2000`. Whistles, clinks, snare cracks need `--band 4000:10000`. If `--min-flux 30` returns nothing, try lowering to 15 first; if it returns thousands, try a different band before raising the threshold.
- **Speed-ramp audio in pairs.** `atempo` accepts only 0.5–2.0 in a single filter; for 4x chain `atempo=2.0,atempo=2.0`. Without `atempo`, `setpts` alone gives chipmunk audio.
- **Don't over-tune ROIs.** Two iterations max. Motion-diff is forgiving — wider ROIs covering mouth+chin work fine even if not perfectly mouth-centered.
- **Watch out for scene cuts inside a clip.** Run `ffmpeg -filter:v "select='gt(scene,0.3)',showinfo" -f null -` to count cuts. If a 16:9→9:16 clip has many cuts, the fixed face ROIs only work for the dominant scene; warn the user, and offer to either pick a single-take clip or accept off-center framing during cuts.
- **Source resolution matters.** If source is 4K, either downscale to 1920×1080 first (faster, fine for 9:16 output) or multiply all ROI/pan coordinates by 2.
Expand Down
93 changes: 93 additions & 0 deletions scripts/detect_transients.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
#!/usr/bin/env python3
"""Detect sharp audio transients (impacts, hits, claps) via spectral flux on a chosen freq band.

Use for action montages where you want to cut on each strike: axe-on-wood, drum
hit, fist-on-pad, ball-on-bat, wood landing on a pile, etc. The detector finds
the broadband impulse that an impact creates, which ambient noise/wind doesn't.

Usage:
python3 detect_transients.py audio.wav [--band 1000:6000] \
[--min-flux 30] [--min-gap 0.6] [--top N]

Output: JSON list of {time, flux, db}, sorted by time, to stdout.

Tuning notes:
- min-flux 30 is a sane default. Real impacts register 50–300; ambient ~5–15.
Too many false positives → raise to 40–50. Too few → drop to 20.
- min-gap is min seconds between accepted peaks. Drumming: ~0.2; chopping: 0.6–1.0.
- band: 1000:6000 covers axe/wood/metal impacts. Heavy thuds (kick, body shot)
live lower (200:2000). Whistles, clinks, snare cracks live higher (4000:10000).
"""
import sys, json, wave, argparse
import numpy as np


def detect(path, band_lo=1000, band_hi=6000, min_flux=30.0, min_gap=0.6):
w = wave.open(path, 'rb')
sr = w.getframerate()
n = w.getnframes()
data = np.frombuffer(w.readframes(n), dtype=np.int16).astype(np.float32) / 32768.0
w.close()

win_size = int(sr * 0.025) # 25 ms windows
hop = int(sr * 0.01) # 10 ms hop
n_frames = (len(data) - win_size) // hop
if n_frames < 4:
return []
frames = np.lib.stride_tricks.as_strided(
data,
shape=(n_frames, win_size),
strides=(data.strides[0] * hop, data.strides[0]),
)
spec = np.abs(np.fft.rfft(frames * np.hanning(win_size), axis=1))

bin_lo = max(1, int(band_lo * win_size / sr))
bin_hi = min(spec.shape[1], int(band_hi * win_size / sr))
diff = np.diff(spec[:, bin_lo:bin_hi], axis=0)
flux = np.sum(np.maximum(diff, 0), axis=1)
times = (np.arange(len(flux)) + 1) * hop / sr

rms = np.sqrt(np.mean(frames ** 2, axis=1) + 1e-10)
db = 20 * np.log10(rms + 1e-10)

min_gap_frames = max(1, int(min_gap * sr / hop))
peaks = []
last = -min_gap_frames
for i in range(2, len(flux) - 2):
if flux[i] < min_flux:
continue
if flux[i] != np.max(flux[max(0, i - 3):i + 4]):
continue
rec = {
'time': round(float(times[i]), 3),
'flux': float(flux[i]),
'db': round(float(db[i]), 1) if i < len(db) else 0.0,
}
if i - last < min_gap_frames:
if peaks and rec['flux'] > peaks[-1]['flux']:
peaks[-1] = rec
last = i
continue
peaks.append(rec)
last = i
return peaks


if __name__ == '__main__':
p = argparse.ArgumentParser()
p.add_argument('audio_wav')
p.add_argument('--band', default='1000:6000', help='freq band Hz, "lo:hi"')
p.add_argument('--min-flux', type=float, default=30.0)
p.add_argument('--min-gap', type=float, default=0.6,
help='min seconds between accepted peaks')
p.add_argument('--top', type=int, default=0,
help='keep only N strongest peaks (0 = all)')
args = p.parse_args()

lo, hi = (int(x) for x in args.band.split(':'))
peaks = detect(args.audio_wav, lo, hi, args.min_flux, args.min_gap)
if args.top:
peaks = sorted(peaks, key=lambda x: -x['flux'])[:args.top]
peaks.sort(key=lambda x: x['time'])
json.dump(peaks, sys.stdout, indent=2)
sys.stdout.write('\n')