ClipAI - Browser NLE Editor Plan

The Problem

Current architecture has a fatal loop: every change requires a full server-side re-render. User changes caption color → waits 2 minutes for Modal to re-render → sees result → wants to tweak → waits 2 more minutes. This kills the editing experience.

Additionally, there are three critical bugs in the current pipeline:

Bug 1: Caption Drift (Captions Fall Behind)

Root cause: The render pipeline applies cuts in step 2 (removing dead air), which shifts the video timeline. But captions in step 5 still reference the ORIGINAL word timestamps from Deepgram. If 5 seconds of silence are removed before t=30, all captions after that point are 5 seconds late. The drift accumulates with each cut - more cuts = worse drift.

Fix: After applying cuts, remap all word timestamps to the new post-cut timeline. Build a time mapping function: originalTime → postCutTime.

Bug 2: Zooms Don't Work

Root cause: The zoom step in renderer.py is marked as non-critical (try/except that silently continues). The FFmpeg scale+crop expression is likely failing due to complex time-based conditionals. Same timeline remapping issue as captions - zoom timestamps reference pre-cut timeline.

Fix: Debug the FFmpeg expression. Apply the same post-cut time remapping to zoom keyframes. Remove the silent failure - log errors properly.

Bug 3: Abrupt Cuts (No J-Cuts)

Root cause: The cut system just hard-splices video segments together using FFmpeg trim + concat. There's no audio crossfade, no brief overlap, no smoothing. The audio cuts mid-waveform creating pops and jarring transitions.

Fix: Add a small audio crossfade (50-100ms) between segments. Use FFmpeg acrossfade filter or overlap segments slightly with fade in/out.

The Vision

Transform ClipAI from "upload → wait → download" into an interactive editing experience:

CURRENT FLOW (painful):
Upload → AI processes (3-5 min) → See rendered result → Change caption color
→ Re-render (2 min) → See result → Change zoom → Re-render (2 min) → ...

NEW FLOW (instant):
Upload → AI processes (3-5 min) → See REAL-TIME PREVIEW in browser
→ Drag timeline to adjust cuts → instant preview
→ Change caption style → instant preview
→ Adjust zoom keyframes → instant preview
→ Edit caption text → instant preview
→ Satisfied? Click "Export" → Final render (2 min) → Download

The key insight: preview in the browser using the original source video + HTML/CSS overlays. Only render once, as the final export step.

Architecture: Browser-Based Playback Engine

Core Concept

Play the ORIGINAL source video in an HTML5 <video> element. Apply all editConfig effects as real-time overlays:

Effect	Browser Preview Method	Final Render Method
Cuts	Custom playback controller skips cut segments via `currentTime` seeks	FFmpeg trim + concat (existing)
Zooms	CSS `transform: scale()` + `transform-origin` on video container	FFmpeg scale + crop (existing, fixed)
Captions	HTML elements absolutely positioned over video, synced to playback time	FFmpeg drawtext (existing, with time remapping fix)
Aspect ratio	CSS `object-fit` + `object-position` + container sizing	FFmpeg crop (existing)
Transitions	CSS animations (opacity, transform) between segments	FFmpeg filters (existing)
Reframing	CSS `object-position` shifts to track face	FFmpeg crop (existing)

Why HTML5 Video + CSS (Not Remotion, Canvas, or WebCodecs)

Remotion - Adds significant complexity. Requires maintaining React compositions that mirror the FFmpeg pipeline. The dual-render consistency problem (browser vs server must match pixel-perfect) is hard. Overkill for our needs.

Canvas/WebGL - More power than we need. HTML/CSS handles text overlays, transforms, and positioning perfectly. Can upgrade to Canvas later for advanced effects (filters, blend modes, particles).

WebCodecs - Too low-level. Great for frame-by-frame processing but we don't need that for preview. Browser support still limited.

HTML5 + CSS - Simplest, fastest to implement, GPU-accelerated transforms, native text rendering for captions, works on all browsers. Good enough for 95% of editing preview. The final render (FFmpeg on server) handles the remaining 5% (exact frame-accurate cuts, encoding quality).

Playback Engine Design

class PlaybackEngine {
  private video: HTMLVideoElement       // Original source video
  private editConfig: EditConfig        // Current edit state
  private segments: Segment[]           // Computed keep-segments (inverse of cuts)
  private timeMap: TimeMap              // originalTime ↔ outputTime mapping
  private currentSegmentIndex: number
  private isPlaying: boolean
  private animationFrameId: number

  // Core loop - runs every frame via requestAnimationFrame
  tick() {
    const outputTime = this.getOutputTime()

    // 1. Check if we hit a cut boundary - if so, seek to next segment
    if (this.isInCutRegion(outputTime)) {
      this.seekToNextSegment()
      return
    }

    // 2. Apply zoom for current time
    const zoom = this.getActiveZoom(outputTime)
    this.applyZoomCSS(zoom)

    // 3. Update captions for current time
    const caption = this.getActiveCaption(outputTime)
    this.updateCaptionOverlay(caption)

    // 4. Fire time update events (for timeline cursor)
    this.emit('timeupdate', outputTime)

    if (this.isPlaying) {
      this.animationFrameId = requestAnimationFrame(() => this.tick())
    }
  }
}

Time Mapping (Critical Component)

The TimeMap handles conversion between three timelines:

SOURCE TIMELINE:     |---A---|##CUT##|---B---|##CUT##|---C---|
                     0       5       7       12      14      30

OUTPUT TIMELINE:     |---A---|---B---|---C---|
                     0       5       10      26

(A = 5s, B = 5s, C = 16s → total output = 26s instead of 30s)

class TimeMap {
  private segments: { sourceStart: number; sourceEnd: number; outputStart: number }[]

  // Convert output time (what user sees) to source time (where to seek video)
  outputToSource(outputTime: number): number

  // Convert source time (from transcript) to output time (for display)
  sourceToOutput(sourceTime: number): number

  // Get total output duration (after cuts)
  get totalDuration(): number
}

This same TimeMap is used to:

Remap caption timestamps for preview AND server render
Remap zoom timestamps for preview AND server render
Position the timeline cursor correctly
Calculate segment positions in the timeline UI

editConfig Evolution

Current: "Cuts" (Segments to REMOVE)

cuts: [
  { id: "c1", start: 5.0, end: 7.0, reason: "silence" },
  { id: "c2", start: 12.0, end: 14.0, reason: "filler" },
]

New: "Segments" (Parts to KEEP, in order)

segments: [
  {
    id: "seg_1",
    sourceStart: 0.0,       // where this segment starts in original video
    sourceEnd: 5.0,          // where it ends in original video
    transition: "none",      // transition INTO this segment
  },
  {
    id: "seg_2",
    sourceStart: 7.0,
    sourceEnd: 12.0,
    transition: "crossfade", // crossfade from previous segment
  },
  {
    id: "seg_3",
    sourceStart: 14.0,
    sourceEnd: 30.0,
    transition: "hard",
  },
]

Why Segments > Cuts

Capability	Cuts Model	Segments Model
Remove silence	Yes	Yes (gap between segments)
Adjust cut points	Modify cut start/end	Drag segment edges
Reorder sections	No	Yes (reorder array)
Split a segment	Complex	Natural (split into two)
Per-segment transitions	No	Yes (transition per segment)
Visual timeline	Awkward (show negatives)	Natural (show positives)
Speed per segment	No	Yes (add speed field later)

Backward Compatibility

Simple conversion:

// cuts → segments
function cutsToSegments(cuts: Cut[], videoDuration: number): Segment[] {
  // Sort cuts by start time
  // Calculate keep regions between cuts
  // Return as segments
}

// segments → cuts (for server render compatibility)
function segmentsToCuts(segments: Segment[], videoDuration: number): Cut[] {
  // Calculate gaps between segments
  // Return as cuts
}

The editConfig keeps both representations during transition. The NLE UI works with segments. The server renderer can consume either.

Updated editConfig Schema

interface EditConfig {
  // NEW - primary representation for NLE
  segments: SegmentConfig[]

  // KEPT - for backward compat with server renderer (auto-computed from segments)
  cuts: CutConfig[]

  // UNCHANGED
  zooms: ZoomConfig[]
  reframing: ReframingConfig
  captions: CaptionConfig
  transitions: TransitionConfig  // global defaults, segments override per-segment
  audio: AudioConfig
  overlays: OverlayConfig

  // NEW - caption text overrides (user edits to AI transcription)
  captionOverrides: {
    [wordIndex: number]: {
      text?: string           // corrected text
      hidden?: boolean        // hide this word
      highlight?: boolean     // force highlight
    }
  }
}

NLE UI Components

Layout

┌─────────────────────────────────────────────────────────┐
│  Toolbar: Undo/Redo │ Zoom In/Out │ Play/Pause │ Export │
├──────────────────────────┬──────────────────────────────┤
│                          │                              │
│     VIDEO PREVIEW        │     PROPERTIES PANEL         │
│     (9:16 aspect)        │                              │
│                          │     [Context-sensitive]       │
│     ┌──────────┐         │     - Caption style/text     │
│     │          │         │     - Zoom settings           │
│     │  Video   │         │     - Segment properties      │
│     │  Player  │         │     - Audio controls          │
│     │  + CSS   │         │     - Effect settings          │
│     │ overlays │         │                              │
│     │          │         │                              │
│     └──────────┘         │                              │
│                          │                              │
├──────────────────────────┴──────────────────────────────┤
│  TIMELINE                                                │
│  ┌────────────────────────────────────────────────────┐  │
│  │ ▶ Video:  [===seg1===]  [===seg2===]  [==seg3==]  │  │
│  │ ♪ Audio:  [~~~waveform across segments~~~~~~~~~~~~]│  │
│  │ T Caption:[hello world] [this is]  [awesome video] │  │
│  │ ⊕ Zooms:  ◆           ◆      ◆                    │  │
│  └────────────────────────────────────────────────────┘  │
│  |0:00    |0:05    |0:10    |0:15    |0:20    |0:25     │
└─────────────────────────────────────────────────────────┘

Component Breakdown

1. PlaybackEngine (Core - No UI)

File: lib/editor/PlaybackEngine.ts

The brain. Manages video playback, time mapping, and effect synchronization. All other components subscribe to its state.

interface PlaybackEngineState {
  isPlaying: boolean
  outputTime: number          // current time in the output timeline
  sourceTime: number          // corresponding time in source video
  totalDuration: number       // total output duration
  currentSegmentIndex: number
  activeZoom: ZoomConfig | null
  activeCaptions: CaptionGroup[]
}

2. VideoPreview (Upgraded)

File: components/editor/VideoPreview.tsx

Shows the source video with CSS overlays. Receives zoom/caption state from PlaybackEngine.

┌─────────────────┐
│  <video> element │ ← CSS transform for zoom
│  ┌─────────────┐│
│  │  Caption     ││ ← Absolutely positioned HTML
│  │  overlay     ││
│  └─────────────┘│
└─────────────────┘

Video plays the original source file
Container div applies zoom via CSS transform
Caption div overlays text synced to playback time
Aspect ratio handled by container CSS

3. Timeline

File: components/editor/Timeline.tsx

Multi-track timeline with drag interactions.

Tracks:

Video Track: Shows segments as colored blocks. Gaps between segments = cuts. Drag left/right edges to adjust in/out points. Click gap to restore cut region. Right-click segment to split.
Caption Track: Shows caption groups as text blocks. Click to edit text. Drag edges to adjust timing.
Zoom Track: Shows zoom keyframes as diamond markers on a horizontal line. Drag to reposition. Click to select and edit in properties panel.
Audio Track: (Stretch goal) Waveform visualization. Shows where sound effects and music are placed.

Interactions:

Horizontal scroll/pinch to zoom timeline
Playhead (vertical red line) shows current position, draggable for scrubbing
Hover on segment edges shows resize cursor
Drag segment edges for ripple edit (adjusts cut points)
Keyboard: Space = play/pause, J/K/L = shuttle, Left/Right = frame step, I/O = set in/out

4. PropertiesPanel

File: components/editor/PropertiesPanel.tsx

Context-sensitive panel that changes based on what's selected:

Nothing selected: Global settings (aspect ratio, audio normalize, export settings)
Segment selected: Segment properties (source in/out points, transition type, speed)
Caption selected: Caption text editor, style picker, colors, position, font size
Zoom selected: Zoom scale, duration, easing, anchor point

5. CaptionEditor (Inline)

File: components/editor/CaptionEditor.tsx

When a caption group is selected in the timeline or properties panel:

Show the transcript text for that time range
Allow direct text editing (fix typos, rewrite)
Per-word controls: highlight toggle, hide toggle
Preview updates instantly as user types

6. Toolbar

File: components/editor/Toolbar.tsx

Undo/Redo stack (all editConfig changes are undoable)
Timeline zoom controls (zoom in/out the timeline view)
Play/Pause/Stop
Current time display
Export button (triggers final render)
Aspect ratio quick-switch (9:16 / 1:1 / 4:5)

State Management

Zustand Store

interface EditorStore {
  // Source data (immutable after load)
  clip: Clip
  project: Project
  sourceVideoUrl: string
  transcript: Transcript

  // Edit state (mutable, undo/redo stack)
  editConfig: EditConfig
  editHistory: EditConfig[]     // undo stack
  editFuture: EditConfig[]      // redo stack

  // UI state
  selectedElement: SelectedElement | null  // segment, caption, zoom, or null
  timelineZoom: number          // pixels per second
  timelineScroll: number        // horizontal scroll position

  // Playback state (from PlaybackEngine)
  isPlaying: boolean
  currentTime: number

  // Actions
  updateEditConfig: (partial: Partial<EditConfig>) => void
  undo: () => void
  redo: () => void
  selectElement: (element: SelectedElement | null) => void

  // Segment actions
  splitSegment: (segmentId: string, atTime: number) => void
  deleteSegment: (segmentId: string) => void
  adjustSegmentEdge: (segmentId: string, edge: 'start' | 'end', newTime: number) => void
  reorderSegments: (fromIndex: number, toIndex: number) => void

  // Zoom actions
  addZoom: (time: number) => void
  updateZoom: (zoomId: string, changes: Partial<ZoomConfig>) => void
  deleteZoom: (zoomId: string) => void

  // Caption actions
  updateCaptionText: (wordIndex: number, newText: string) => void
  toggleWordHighlight: (wordIndex: number) => void
  updateCaptionStyle: (changes: Partial<CaptionConfig>) => void
}

Undo/Redo

Every editConfig change pushes the previous state onto the undo stack. This gives free undo/redo for ALL editing operations (cut adjustments, caption edits, zoom changes, style changes).

updateEditConfig: (partial) => {
  const current = get().editConfig
  set({
    editHistory: [...get().editHistory, current],
    editFuture: [],  // clear redo stack on new change
    editConfig: { ...current, ...partial },
  })
}

Data Flow

Loading the Editor

1. User navigates to /editor/[clipId]
2. Fetch clip document from Firestore (contains editConfig, source times)
3. Fetch project document (contains transcript, raw video URL)
4. Get signed URL for source video from Firebase Storage
5. Initialize PlaybackEngine with source video URL + editConfig
6. Initialize Zustand store with all data
7. Render NLE UI

Real-Time Preview Loop

User changes editConfig (e.g., drags segment edge)
         ↓
Zustand store updates editConfig
         ↓
PlaybackEngine recomputes TimeMap (segments changed → new time mapping)
         ↓
If playing:
  - PlaybackEngine adjusts video.currentTime if needed
  - CSS zoom updates on next tick()
  - Caption overlay updates on next tick()
If paused:
  - Seek to equivalent position in new timeline
  - Update zoom + captions for current frame
         ↓
Timeline UI re-renders with new segment positions
Properties panel updates if selected element changed

Saving (Auto-Save)

editConfig changes (debounced 2 seconds)
         ↓
Convert segments → cuts (for backward compat)
         ↓
Save editConfig to Firestore (clip document)
         ↓
User's changes persist across sessions
No render triggered - just data save

Exporting (Final Render)

User clicks "Export"
         ↓
Validate editConfig (segments exist, valid times, etc.)
         ↓
Save final editConfig to Firestore
         ↓
Call Cloud Function → triggers Modal render
         ↓
Modal downloads source video + reads editConfig from Firestore
         ↓
Server-side FFmpeg render pipeline (FIXED version):
  1. Extract segments (not clip + cuts, but direct segment extraction)
  2. Apply zoom keyframes (with correct post-cut timestamps)
  3. Render captions (with correct post-cut timestamps + user text overrides)
  4. Audio processing (normalize, crossfade between segments)
  5. Final encode (1080x1920 H.264)
         ↓
Upload to Firebase Storage
         ↓
Update clip.status = "done", clip.rendered.videoUrl
         ↓
Frontend shows download button

Implementation Phases

Phase 0: Fix Critical Bugs (2-3 days)

Fix the three bugs in the current pipeline before building the NLE.

0.1 Build TimeMap utility (Python + TypeScript)

Python version in processing/pipeline/time_map.py for server render
TypeScript version in apps/web/src/lib/editor/TimeMap.ts for browser
Both must produce identical mappings given the same cuts/segments
Unit tests with known inputs/outputs

0.2 Fix caption drift

In renderer.py: after apply_cuts(), compute TimeMap from cuts
Before render_captions(), remap all word timestamps using TimeMap
Verify: captions match spoken words throughout entire clip

0.3 Fix zoom failures

Debug the FFmpeg scale+crop expression in apply_zooms()
Apply same TimeMap remapping to zoom timestamps
Remove silent failure - log errors, return previous step's output explicitly
Verify: zooms visibly happen at key moments

0.4 Fix abrupt cuts

Add 50-100ms audio crossfade between segments in apply_cuts()
Use FFmpeg acrossfade filter or manual fade out/in on segment boundaries
Verify: cuts sound smooth, no audio pops

Phase 1: Playback Engine + Basic Preview (5-7 days)

Replace the "play rendered video" preview with a real-time engine.

1.1 Source video loading

Fetch original source video URL (not rendered) from Firebase Storage
Load into HTML5 <video> element with preload="auto"
Handle CORS, signed URLs, caching

1.2 PlaybackEngine core

lib/editor/PlaybackEngine.ts
Implements tick loop with requestAnimationFrame
Manages play/pause/seek
Skips over cut regions by seeking video.currentTime
Fires timeupdate events on output timeline

1.3 TimeMap (TypeScript)

lib/editor/TimeMap.ts
Bidirectional mapping: outputTime ↔ sourceTime
Computed from segments array
Used by all preview components

1.4 Zoom preview via CSS

Apply transform: scale(X) to video container
Apply transform-origin: X% Y% for anchor point
Interpolate scale value based on zoom easing over duration
Smooth 60fps transitions

1.5 Caption preview via HTML overlay

Absolutely positioned div over video
Group words by maxWordsPerLine + timing
Show/hide caption groups based on current output time
Apply caption styles (color, size, position, font) via CSS
Word-by-word highlight for hormozi style

1.6 Aspect ratio preview

Container sizing for 9:16, 1:1, 4:5
object-fit: cover + object-position for reframing
Instant switch between ratios

1.7 Replace current VideoPreview

Swap out the rendered video player with the new engine
Side by side: new preview engine + existing controls panel
Verify it looks and feels like the rendered output

Phase 2: Timeline Component (5-7 days)

Build the multi-track timeline for visual editing.

2.1 Timeline foundation

components/editor/Timeline.tsx
Horizontal scrollable container
Time ruler at bottom (shows seconds/minutes)
Playhead (vertical line at current time, draggable)
Zoom controls (pixels per second)

2.2 Video/Segment track

Show segments as colored blocks proportional to duration
Gaps between segments represent cuts
Waveform thumbnail inside segments (stretch goal)
Segment labels (could show first few words of transcript)

2.3 Segment interactions

Drag left edge = adjust segment start (sourceStart)
Drag right edge = adjust segment end (sourceEnd)
Click on gap = restore that region (merge adjacent segments)
Right-click segment = split at playhead position
Delete key = remove segment
Drag to reorder (stretch goal)
All changes update editConfig → PlaybackEngine recomputes → preview updates instantly

2.4 Caption track

Show caption groups as text blocks
Positioned according to output timeline
Click caption block = select it (shows in properties panel)
Drag edges to adjust timing

2.5 Zoom track

Show zoom keyframes as diamond markers
Horizontal line representing zoom intensity (height = scale)
Drag marker left/right to adjust timing
Drag marker up/down to adjust scale
Click to select, edit in properties panel

2.6 Keyboard shortcuts

Space: play/pause
J/K/L: reverse/pause/forward (standard NLE)
Left/Right arrow: step 1 frame (1/30s)
Shift+Left/Right: step 1 second
Cmd+Z / Cmd+Shift+Z: undo/redo
I: set segment in-point at playhead
O: set segment out-point at playhead
S: split segment at playhead
Delete: remove selected element

Phase 3: Properties Panel + Caption Editor (3-5 days)

Context-sensitive controls for editing selected elements.

3.1 Properties panel framework

components/editor/PropertiesPanel.tsx
Renders different content based on selectedElement type
Collapsible sections within each context

3.2 Segment properties

Source in/out time (editable number inputs)
Transition type selector (hard cut, crossfade, fade)
Playback speed (1x default, 0.5x-2x range) - stretch goal

3.3 Caption style controls (upgraded from current)

Style preset picker (hormozi, minimal, karaoke, bold)
Color pickers (primary, highlight, background)
Font family selector
Font size (slider, not just S/M/L)
Position (top/center/bottom, with Y offset slider)
Words per line
Animation type

3.4 Inline caption text editor

Shows transcript words for selected time range
Each word editable (click to edit)
Strike-through to hide a word
Star/highlight toggle per word
Changes stored in editConfig.captionOverrides
Preview updates instantly

3.5 Zoom properties

Scale slider (1.0 - 1.5)
Duration slider (0.1s - 2.0s)
Easing selector (ease in/out, ease in, linear, snap)
Anchor point picker (visual crosshair on video preview)
"Add zoom at playhead" button

3.6 Global settings

Aspect ratio selector
Audio normalization toggle
Export quality settings

Phase 4: Undo/Redo + Auto-Save + Polish (2-3 days)

4.1 Undo/redo system

Zustand middleware that tracks editConfig history
Cmd+Z / Cmd+Shift+Z
Visual undo/redo buttons in toolbar
History limit (50 states to avoid memory issues)

4.2 Auto-save

Debounced save to Firestore on editConfig change (2 second delay)
Save indicator in toolbar ("Saved" / "Saving..." / "Unsaved changes")
On page unload: warn if unsaved changes

4.3 Export flow

"Export" button in toolbar
Shows export settings modal (quality, format)
Triggers server-side render with final editConfig
Progress indicator
Download button when done

4.4 Polish

Loading states for all async operations
Error boundaries and recovery
Responsive layout (minimum 1024px width for editor)
Performance optimization (memoization, virtualized timeline for long videos)
Smooth animations and transitions in UI

Phase 5: Server Renderer Update (2-3 days)

Update the Modal pipeline to consume the new editConfig format.

5.1 Segments-based rendering

Update renderer.py to work with segments directly (not compute from cuts)
Extract each segment individually, then concat
Apply per-segment transitions

5.2 TimeMap in Python

Port TimeMap to Python
Use for remapping zoom and caption timestamps
Shared test vectors between TypeScript and Python versions

5.3 Caption overrides

Read captionOverrides from editConfig
Apply text corrections, hidden words, forced highlights
Ensure rendered captions match preview exactly

5.4 Audio crossfades

Apply crossfade between segments (configurable duration)
Sound effects at transitions (optional)
Volume normalization as final step

Source Video Access Strategy

For the browser preview to work, the original source video must be streamable in the browser.

Current: Source video is in Firebase Storage at raw/{uid}/{projectId}/video.mp4. Already accessible via signed URLs. The browser can play it directly.

Optimization: For long videos, create an HLS (m3u8) or DASH manifest for adaptive streaming. For MVP, direct MP4 streaming is fine (Firebase Storage supports range requests).

Caching: Once loaded, the browser caches the video. The <video> element handles buffering natively. For repeat sessions, consider using the Cache API to store the video locally.

Performance Considerations

Video Seeking (Cut Playback)

When the PlaybackEngine hits a cut boundary, it needs to seek the video element. HTML5 video seeking is not instant - there's a brief delay (50-200ms) for keyframe-based seeking. Mitigations:

Pre-buffer upcoming segments by seeking ahead during playback
Use fastSeek() for approximate seeks (faster but less accurate)
For very short segments (< 0.5s), consider pre-decoding with WebCodecs (stretch)

CSS Transform Performance

CSS transforms (scale, translate) are GPU-accelerated and run at 60fps. The zoom preview will be smooth. Key: use will-change: transform on the video container.

Caption Rendering

HTML text rendering is fast. Even with 100+ caption groups, DOM updates are negligible. Use display: none instead of removing elements for show/hide.

Timeline Rendering

For videos > 5 minutes with many segments, the timeline could have hundreds of elements. Use virtualization (only render visible segments) and canvas for the waveform (stretch goal).

Memory

Loading a full source video (e.g., 200MB for 10 minutes) uses significant memory. The browser handles this via range requests and buffering, but be aware of mobile limitations. For MVP, assume desktop-only editing (require minimum 1024px viewport).

File Structure (New/Modified)

apps/web/src/
├── lib/
│   └── editor/
│       ├── PlaybackEngine.ts        # NEW - core playback engine
│       ├── TimeMap.ts               # NEW - timeline mapping utility
│       ├── segments.ts              # NEW - segment manipulation helpers
│       └── keyboardShortcuts.ts     # NEW - keyboard shortcut handler
├── stores/
│   └── editorStore.ts               # NEW - Zustand store for editor state
├── components/
│   └── editor/
│       ├── VideoPreview.tsx          # MODIFIED - uses PlaybackEngine instead of <video src={rendered}>
│       ├── CaptionOverlay.tsx        # NEW - HTML caption overlay synced to playback
│       ├── Timeline.tsx              # MODIFIED - complete rewrite as multi-track NLE timeline
│       ├── TimelineTrack.tsx         # NEW - generic track component
│       ├── SegmentTrack.tsx          # NEW - video segments track
│       ├── CaptionTrack.tsx          # NEW - caption blocks track
│       ├── ZoomTrack.tsx             # NEW - zoom keyframes track
│       ├── Playhead.tsx              # NEW - draggable playhead
│       ├── TimeRuler.tsx            # NEW - time ruler with markings
│       ├── PropertiesPanel.tsx       # NEW - context-sensitive properties
│       ├── CaptionEditor.tsx         # NEW - inline caption text editor
│       ├── Toolbar.tsx               # NEW - top toolbar (undo, play, export)
│       ├── ExportModal.tsx           # NEW - export settings + progress
│       ├── CaptionStylePicker.tsx    # KEPT - moved into PropertiesPanel
│       ├── ZoomControls.tsx          # KEPT - moved into PropertiesPanel
│       └── EditConfigPanel.tsx       # REMOVED - replaced by PropertiesPanel + Timeline
├── app/
│   └── (dashboard)/
│       └── editor/
│           └── [clipId]/
│               └── page.tsx          # MODIFIED - new NLE layout

processing/pipeline/
├── time_map.py                      # NEW - Python TimeMap (mirrors TypeScript)
├── renderer.py                      # MODIFIED - segments-based, fixed timestamps
├── captions.py                      # MODIFIED - supports captionOverrides
└── utils.py                         # MODIFIED - segment extraction helpers

Migration Strategy

The NLE editor is an evolution, not a replacement. Approach:

Keep the current editor working while building the NLE
Feature flag: useNLEEditor in user settings or URL param ?nle=true
Gradual rollout: NLE for new clips, old editor as fallback
Data compatible: New segments format auto-converts to/from old cuts format
Server renderer accepts both: Old editConfig (with cuts) and new editConfig (with segments)

Success Metrics

Preview latency: < 100ms from editConfig change to visual update (vs 2+ minutes currently)
Caption sync: Captions match audio within 50ms throughout entire clip
Zoom accuracy: All zoom keyframes fire at intended moments
Cut smoothness: No audio pops or visual glitches at cut points
Export parity: Final rendered video matches browser preview (within acceptable tolerance)
Edit speed: User can make 10+ adjustments and export in under 5 minutes (vs 20+ minutes with re-render loop)

Estimated Timeline

Phase	Duration	Deliverable
Phase 0: Bug fixes	2-3 days	Working cuts, zooms, captions in current pipeline
Phase 1: Playback engine	5-7 days	Real-time preview in browser, no re-rendering needed
Phase 2: Timeline	5-7 days	Multi-track timeline with drag editing
Phase 3: Properties + captions	3-5 days	Full editing controls for all elements
Phase 4: Polish	2-3 days	Undo/redo, auto-save, export flow
Phase 5: Server update	2-3 days	Updated renderer matching new editConfig
Total	~4-5 weeks	Full browser-based NLE editor

Phase 0 and Phase 5 can partially overlap with other phases.

Risks & Mitigations

Risk	Impact	Mitigation
Video seeking latency causes gaps during cut playback	Medium	Pre-buffer next segment, use fastSeek(), add brief crossfade
Browser preview doesn't match final render	High	Use same TimeMap logic in both, visual regression tests
Large source videos cause memory issues	Medium	Desktop-only for editing, lazy loading, range requests
Complex FFmpeg expressions break on edge cases	Medium	Extensive test suite with real videos, fallback rendering paths
Timeline performance with many segments	Low	Virtualized rendering, canvas-based waveform
Caption styling differs HTML vs FFmpeg drawtext	Medium	Design caption styles to be achievable in both renderers

What This Enables (Future)

Once the NLE foundation exists, these features become natural extensions:

B-roll insertion: Add image/video overlays on the timeline (new track type)
Multi-camera editing: Multiple video tracks, switch between angles
Text overlays: Title cards, lower thirds (HTML overlay, same as captions)
Audio tracks: Background music, sound effects as draggable clips on audio track
Keyframe animation: Animate any property over time (position, opacity, scale)
Templates: Save and apply editConfig templates across clips
Collaborative editing: Real-time sync via Firestore listeners (multiple users editing same clip)
AI re-analysis: "Re-analyze this segment" to get new zoom/cut suggestions for a portion

FilesExpand file tree

PLAN.md

Latest commit

History