Opinionated sparse visual extraction for lifelog-style video before upload.
The current goal is not to preserve smooth playback-quality video. The goal is to preserve useful visual information for downstream memory, search, coverage, and analysis workflows while making local preprocessing and upload much faster.
This repo is a reusable utility that can be embedded into or called from other programs before upload.
The tool is intentionally opinionated:
- one input video file
- one output visual bundle
- preserve intrinsic media metadata from that video
- very few public knobs
The strongest practical default found so far is:
- sample at
1 fps - extract images, not low-fps video
- use Apple-native
AVAssetImageGeneratoron macOS - use requested time tolerance of
+/- 0.5s - scale to fit inside
1920x1080 - pad with black bars to a fixed
16:9canvas - write JPEG frames
- store a manifest of requested timestamps and actual timestamps
This preserves a regular sparse timeline while staying very fast on Apple Silicon.
The public surface is intentionally tiny:
- one input video file
- one output bundle folder
- one opinionated extraction recipe
Commands:
cargo run -- spec
cargo run -- extract /path/to/video.mp4 /path/to/output-bundle
cargo run -- batch /path/to/video-folder /path/to/output-rootOn macOS, the Rust crate compiles and invokes a tiny native Swift helper built on AVAssetImageGenerator. Rust owns the package shape and bundle emission; the native helper owns the actual frame extraction.
Library entry points:
extract_to_dir(input, output_dir)extract_directory_to_dir(input_root, output_root)load_bundle_metadata(bundle_dir)load_manifest(bundle_dir)
Each bundle also keeps raw source metadata artifacts alongside the extracted frames:
metadata/source-metadata.jsonmetadata/source-probe.jsonmetadata/source-mdls.txtmetadata/source-xattrs.txt
For Guardian-style use cases, sparse visual checkpoints are more useful than preserving a conventional video container.
That means the canonical derivative should be:
- timestamped JPEG frames
- plus metadata / manifest
instead of:
- HEVC or H.264 low-fps video
Using a compiled native macOS utility built on AVAssetImageGenerator was dramatically faster than the earlier ffmpeg-based pipelines for this use case.
Important note:
- early slow results were distorted by running the benchmark through the
swiftinterpreter - once compiled as a native binary, the same extraction approach became much faster
The current preferred output shape is:
- fixed
1920x1080 - scale-to-fit
- black padding bars instead of cropping
Reasons:
- preserves all source pixels
- makes downstream backend processing simpler
- avoids source-specific layout branching
- keeps the representation tool-friendly
Some cameras provide low-resolution companion media such as .lrv or similar preview files.
Those are excellent acceleration paths when available, but the core method should still work well on originals so the system is not dependent on vendor-specific support.
These are the most important benchmark results so far.
Source:
1920x144024 fps- HEVC
Compiled Apple-native sparse extraction:
1080p,+/-0.5s: about27.0x realtime720p,+/-0.5s: about31.0x realtime540p,+/-0.5s: about36.0x realtime
Source:
2704x2028120 fps- HEVC
Compiled Apple-native sparse extraction:
1080p,+/-0.5s: about34.9x realtime720p,+/-0.5s: about44.0x realtime540p,+/-0.5s: about53.2x realtime
For the heavy clip, timestamp drift remained small:
- max absolute error: about
0.112s - mean absolute error: about
0.056s
On a matched proxy / preview file:
- Apple-native extraction at
1080p,+/-0.5s: about20.0x realtime
Proxy-first is still valuable, but the main discovery is that the general-purpose original-file path is already fast enough.
Re-encoding to low-fps video was not the best representation or the best speed path for this problem.
These were often much slower than the compiled Apple-native sparse extraction path.
Testing 0 / +1.0s and 0 / +1.5s tolerance windows showed:
- faster extraction on some light clips
- but too many duplicated or collapsed actual timestamps
So the best current default remains:
- symmetric
+/- 0.5s
video-sampler was useful as an exploratory reference, especially for keyframe/dedup ideas, but it was not the raw speed winner for the core sparse extraction task.
Current recommendation:
- Run a quick local probe on each source file.
- If a trusted preview / proxy companion exists, optionally use it.
- Otherwise, use the general-purpose Apple-native sparse extraction path on the original.
- Produce:
- padded
1920x1080JPEG frames - manifest with requested and actual timestamps
- padded
- Upload the sparse visual bundle rather than a recompressed video derivative.
The embedding app or service can decide how to combine or upload many bundles, but that is outside the core tool itself.
Closed-loop validation against source is now in the repo via:
scripts/validate_against_source.py
Current native-package validation:
- light clip fidelity:
- PSNR mean:
30.37 dB - SSIM mean:
0.9803
- PSNR mean:
- heavy GoPro fidelity, first 30 frames:
- PSNR mean:
33.83 dB - SSIM mean:
0.9747
- PSNR mean:
Current release-binary extraction timings:
- light clip (
15ssource):1.23swall time, about12.2x realtime - heavy GoPro clip (
112.7ssource):5.44swall time, about20.7x realtime
- add a small library API on top of the current CLI path
- benchmark folder-scale extraction on larger real datasets
- decide whether
1080pshould remain the default or whether720pis the better product choice - optionally add proxy discovery rules for common camera ecosystems
Detailed notes live in docs/technical-findings.md. The current real-workload benchmark lives in docs/benchmark-2026-03-30-real-workload.md.
The first concrete interchange spec lives in docs/visual-bundle-v1.md.
The repo includes a Rust crate plus CLI:
cargo run -- specCurrent public commands are intentionally minimal:
extractspechelp
The CLI should stay small rather than becoming highly configurable.