feat(nodes): add video embedding node for semantic search and RAG pipelines#516
feat(nodes): add video embedding node for semantic search and RAG pipelines#516charliegillet wants to merge 12 commits intorocketride-org:developfrom
Conversation
…elines Introduces embedding_video node that extracts frames from video input at configurable intervals and generates vector embeddings using CLIP/ViT models. Supports MP4, AVI, MOV, and WebM formats with configurable frame interval, max frames, start time, and duration parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new GPU-backed video embedding node with lifecycle management, dependency installation, streaming video ingestion, frame extraction and per-frame embedding generation, plus service registration and test configuration. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant IGlobal as IGlobal
participant IInstance as IInstance
participant OpenCV as OpenCV
participant EmbedModel as EmbeddingModel
participant Output as Documents
Client->>IGlobal: beginGlobal()
IGlobal->>IGlobal: install requirements\nimport ai.common.torch\ncreate Embedding\nload extraction params\ninit device_lock
Client->>IInstance: writeVideo(BEGIN, mimeType)
IInstance->>IInstance: init buffer, store mimeType
loop chunked writes
Client->>IInstance: writeVideo(WRITE, chunk)
IInstance->>IInstance: append -> enforce max_video_size_bytes
end
Client->>IInstance: writeVideo(END)
IInstance->>IInstance: _process_video(video_bytes)
IInstance->>OpenCV: VideoCapture(temp_file)
OpenCV-->>IInstance: frames / fps / count
loop per extracted frame
IInstance->>OpenCV: read/seek frame
IInstance->>IInstance: encode PNG -> image
IInstance->>EmbedModel: request embedding (with device_lock)
EmbedModel-->>IInstance: embedding vector
IInstance->>IInstance: create Doc with metadata
end
IInstance->>Output: writeDocuments(batch)
IInstance->>IInstance: cleanup temp file
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/embedding_video/IGlobal.py`:
- Around line 64-67: The Embedding constructor is being called with the
already-processed config causing double processing; change the call in
IGlobal.__init__ to pass the raw connection dict (self.glb.connConfig) as the
second argument to Embedding instead of the processed config, and stop using the
local config variable for construction—either remove the Config.getNodeConfig()
assignment or only call Config.getNodeConfig() once for extracting
frame_interval/max_frames/etc. and do not pass that processed value into
Embedding(self.glb.logicalType, ...); use Embedding to return or expose any
processed config needed for later use.
In `@nodes/src/nodes/embedding_video/IInstance.py`:
- Line 90: The _process_video method in IInstance.py is too large; extract the
frame extraction setup logic into a new helper (e.g., _setup_frame_extraction)
and extract per-frame logic into another helper (e.g., _process_frame) so
_process_video becomes a high-level orchestrator calling these helpers; move the
code that initializes capture, codec/format checks and any config/loop setup
into _setup_frame_extraction, move the code that decodes/processes a single
frame, computes embeddings and handles per-frame state into _process_frame, and
update _process_video to call these helpers passing required parameters (frame
index, frame bytes, capture handle, config) while preserving exceptions, return
values and variable scope.
- Line 107: The temp file is always created with suffix='.mp4' in the
tempfile.mkstemp call which ignores the actual MIME type; update the code to
derive the suffix from the SUPPORTED_VIDEO_TYPES mapping (falling back to '.mp4'
if unknown) before calling tempfile.mkstemp so the created temporary file uses
the correct extension. Locate the tempfile.mkstemp(...) usage and replace the
hardcoded '.mp4' with a lookup like SUPPORTED_VIDEO_TYPES.get(mime_type, '.mp4')
(where mime_type is the variable holding the incoming MIME/content type),
ensuring the rest of the function (e.g., VideoCapture handling) remains
unchanged.
- Around line 33-38: SUPPORTED_VIDEO_TYPES is declared but never used; update
the _process_video function to validate incoming MIME types against
SUPPORTED_VIDEO_TYPES and use the mapped extension when creating temp files (or
remove the constant if you opt not to validate). Specifically, in
_process_video, read the incoming video's MIME/type header or inferred mime
variable, check membership in SUPPORTED_VIDEO_TYPES, throw or return a clear
error when unsupported, and when creating the temporary filename use
SUPPORTED_VIDEO_TYPES[mime] to set the file extension so temp files have the
correct suffix; reference SUPPORTED_VIDEO_TYPES and the _process_video function
to make the change.
In `@nodes/src/nodes/embedding_video/services.json`:
- Around line 111-117: The JSON schema for "embedding_video.model" currently
restricts Hugging Face model identifiers with "maxLength": 32 which is too small
for many real model paths; update the "maxLength" value for
"embedding_video.model" (and adjust "minLength" if needed) to a larger limit
such as 128 or 256 so long model identifiers like
"sentence-transformers/clip-ViT-B-32-multilingual-v1" are accepted; ensure the
"description" remains accurate and run schema validation/tests after the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5658f252-e174-478d-94c1-21b1bc6c492b
📒 Files selected for processing (5)
nodes/src/nodes/embedding_video/IGlobal.pynodes/src/nodes/embedding_video/IInstance.pynodes/src/nodes/embedding_video/__init__.pynodes/src/nodes/embedding_video/requirements.txtnodes/src/nodes/embedding_video/services.json
| SUPPORTED_VIDEO_TYPES = { | ||
| 'video/mp4': '.mp4', | ||
| 'video/x-msvideo': '.avi', | ||
| 'video/quicktime': '.mov', | ||
| 'video/webm': '.webm', | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
SUPPORTED_VIDEO_TYPES is defined but never used.
The mapping of MIME types to file extensions is defined but not utilized for validation or for determining the temp file extension. Consider either:
- Using it to validate incoming video types and select the appropriate temp file extension
- Removing it if validation is not required
♻️ Suggested usage in _process_video
def _process_video(self, video_bytes: bytes):
...
import tempfile
import os
- tmp_fd, tmp_path = tempfile.mkstemp(suffix='.mp4')
+ # Use the correct extension based on MIME type, default to .mp4
+ extension = SUPPORTED_VIDEO_TYPES.get(self._mime_type, '.mp4')
+ tmp_fd, tmp_path = tempfile.mkstemp(suffix=extension)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/embedding_video/IInstance.py` around lines 33 - 38,
SUPPORTED_VIDEO_TYPES is declared but never used; update the _process_video
function to validate incoming MIME types against SUPPORTED_VIDEO_TYPES and use
the mapped extension when creating temp files (or remove the constant if you opt
not to validate). Specifically, in _process_video, read the incoming video's
MIME/type header or inferred mime variable, check membership in
SUPPORTED_VIDEO_TYPES, throw or return a clear error when unsupported, and when
creating the temporary filename use SUPPORTED_VIDEO_TYPES[mime] to set the file
extension so temp files have the correct suffix; reference SUPPORTED_VIDEO_TYPES
and the _process_video function to make the change.
There was a problem hiding this comment.
SUPPORTED_VIDEO_TYPES is used at line 115 to determine the correct temp file extension based on MIME type: suffix = SUPPORTED_VIDEO_TYPES.get(self._mime_type, '.mp4'). It was added in a prior fix commit (453efd7) precisely to address the hardcoded .mp4 concern.
There was a problem hiding this comment.
Rate Limit Exceeded
@charliegillet have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 27 seconds before sending another message.
| self._process_video(bytes(self._video_data)) | ||
| self._video_data = None | ||
|
|
||
| def _process_video(self, video_bytes: bytes): |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider splitting _process_video for maintainability.
Static analysis flags this method as having too many statements (64 > 50). While the current implementation is readable and follows a logical flow, consider extracting helper methods for:
- Frame extraction setup (lines 112-143)
- Single frame processing (lines 161-199)
This is optional given the method's clear structure.
🧰 Tools
🪛 Ruff (0.15.7)
[warning] 90-90: Too many statements (64 > 50)
(PLR0915)
[warning] 90-90: Missing return type annotation for private function _process_video
Add return type annotation: None
(ANN202)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/embedding_video/IInstance.py` at line 90, The _process_video
method in IInstance.py is too large; extract the frame extraction setup logic
into a new helper (e.g., _setup_frame_extraction) and extract per-frame logic
into another helper (e.g., _process_frame) so _process_video becomes a
high-level orchestrator calling these helpers; move the code that initializes
capture, codec/format checks and any config/loop setup into
_setup_frame_extraction, move the code that decodes/processes a single frame,
computes embeddings and handles per-frame state into _process_frame, and update
_process_video to call these helpers passing required parameters (frame index,
frame bytes, capture handle, config) while preserving exceptions, return values
and variable scope.
There was a problem hiding this comment.
Acknowledged. The method is inherently complex due to OpenCV frame extraction, embedding generation, and document creation all in one pass. I've kept it as a single method since extracting sub-methods would fragment the logic without improving clarity. The try/finally blocks ensure proper cleanup of temp files and VideoCapture resources.
There was a problem hiding this comment.
Rate Limit Exceeded
@charliegillet have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 23 seconds before sending another message.
…ls.list The validateConfig method was calling client.audio.speech.create() which generates audio and costs money on every config validation. Replace with client.models.list() which only verifies the API key is valid. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… memory Check accumulated video size during writeVideo streaming and reject with a clear error if the configurable limit (default 500 MB) is exceeded. Reads maxVideoSizeMB from node config added in services.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… resource leak If embedding fails mid-processing, the OpenCV VideoCapture was never released. Wrap the capture processing in a try/finally to guarantee cap.release() is always called. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace os.write/os.close pattern with 'with os.fdopen(tmp_fd, 'wb')' context manager to guarantee the file descriptor is closed even if writing raises an exception. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…coded .mp4 Look up the file extension from the SUPPORTED_VIDEO_TYPES dict based on the incoming MIME type, falling back to .mp4 only when the type is unrecognized. This ensures OpenCV can correctly identify the container format for non-MP4 videos. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Import PIL.Image once at the top of _process_video instead of re-importing on every frame iteration. This avoids redundant import overhead during frame extraction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/embedding_video/IInstance.py`:
- Around line 154-168: The loop currently calls cap.set(cv2.CAP_PROP_POS_FRAMES,
current_frame_pos) every iteration; change it to only seek when necessary by
tracking the last-read frame index (e.g., last_frame_read) and only calling
cap.set when current_frame_pos != last_frame_read + 1 or when
frame_interval_frames > 1 and the next target frame is not sequential; otherwise
use cap.read() without seeking to avoid the redundant cap.set overhead. Update
variables referenced in this loop (cap, current_frame_pos, fps,
frame_interval_frames, frames_extracted, max_frames, end_time) so
last_frame_read is updated after a successful read and the conditional seek
logic replaces the unconditional cap.set call.
- Around line 96-223: The _process_video method is too large; extract the setup
and per-frame logic into helpers to reduce complexity: pull the "Frame
extraction setup" block that computes fps, total_frames, video_duration,
start_time/end_time, interval, max_frames, start_frame, frame_interval_frames
and current_frame_pos into a new method named _setup_extraction(cap) that
returns (fps, end_time, frame_interval_frames, current_frame_pos, max_frames,
interval) and use it from _process_video; pull the per-frame steps that convert
BGR->RGB, build PIL image, call self.IGlobal.device_lock and
self.IGlobal.embedding.create_image_embedding, encode PNG, create DocMetadata
and Doc, set embedding and embedding_model, and call
self.instance.writeDocuments into a new method named _process_frame(frame,
current_frame_pos, fps) that returns the created Doc (or None on failure), then
simplify the loop in _process_video to call _process_frame and handle
frames_extracted, _frame_chunk_id, and logging exactly as before so behavior
(use of DocMetadata, Doc, self.IGlobal.device_lock, embedding.model_name, and
self.instance.writeDocuments) remains unchanged.
- Around line 59-63: The __init__ currently initializes _frame_chunk_id and
_video_data but not _mime_type which is set later in writeVideo; to prevent
potential AttributeError if _process_video is called before
writeVideo(BEGIN,...), add initialization of self._mime_type = None inside the
__init__ of the class (the same constructor that sets _frame_chunk_id and
_video_data) so subsequent calls to _process_video or any other method can
safely check or set _mime_type; update the __init__ method where _frame_chunk_id
and _video_data are set and ensure any code that uses _mime_type treats None
appropriately.
- Around line 183-185: The PNG is encoded from the original BGR variable `frame`
which is inconsistent with the earlier RGB conversion (`rgb_frame`) used for the
embedding model; update the call to cv2.imencode and subsequent base64 encoding
to use the RGB-converted variable (`rgb_frame`) (or consistently rename
variables so the image used for embeddings is the same one saved as
`frame_base64`) by replacing uses of `frame` in the cv2.imencode and
`frame_base64` construction with the RGB variable so saved PNGs match the
embedded image colors.
In `@nodes/src/nodes/embedding_video/services.json`:
- Around line 94-96: Update the "title" for the openai-patch32 entry to correct
the misleading description: change the title value for the "openai-patch32" JSON
object (the "title" field) to reflect that patch32 trades accuracy for speed,
e.g., "OpenAI CLIP - 32x32 - faster inference, lower accuracy" so it no longer
claims "better recognition."
- Around line 136-146: The schema field names in the embedding_video service use
a prefix (e.g., "embedding_video.interval", "embedding_video.max_frames") but
IGlobal.py/getNodeConfig() reads unprefixed keys (e.g., config.get('interval',
5)), so rename the schema keys to match the config lookups (remove the
"embedding_video." prefix) for all frame extraction settings (interval,
max_frames, etc.), mirroring the working pattern used by embedding_image so the
node will read user-provided values instead of falling back to defaults.
In `@nodes/src/nodes/tts_openai/IGlobal.py`:
- Around line 55-56: Update validateConfig to parse and normalize the speed the
same way beginGlobal does: call the same normalization logic used by beginGlobal
(or extract it into a helper like normalize_speed) and attempt float conversion
inside validateConfig, catching ValueError/TypeError and raising/returning a
validation error if conversion fails; ensure both validateConfig and beginGlobal
use that single normalization helper (or shared code path) so malformed speed
values are rejected during validation rather than crashing at startup.
- Around line 74-79: The f-strings in the TTS validation block use a
double-quoted separator (", ") inside the join; change these to use
single-quoted separators or hoist the joined lists into variables to comply with
the single-quote literal convention. Specifically, in IGlobal.py update the
invalid-model and invalid-voice messages used in the validation code (the
warning calls following the "Validate voice" and model check) to either use ',
'.join(sorted(VALID_MODELS)) and ', '.join(sorted(VALID_VOICES)) or assign
models_list = ', '.join(sorted(VALID_MODELS)) and voices_list = ',
'.join(sorted(VALID_VOICES)) and reference those in the f-strings so all string
literals use single quotes.
- Around line 82-86: Replace the current probe that constructs
OpenAI(api_key=apikey) and calls client.models.list() with a bounded-time,
audio-endpoint check: create the OpenAI client with an explicit timeout (e.g.,
timeout=5_000–10_000 ms) and perform a minimal, non-billable or lightweight
request against the audio endpoint (e.g., a small metadata or HEAD-style call to
/v1/audio/speech or a tiny test request) instead of models.list(); if such a
test is impossible, instead perform a best-effort probe to the audio endpoint
with the timeout set and treat network/timeout errors accordingly. Ensure you
update the code locations referencing OpenAI(api_key=apikey) and models.list()
in IGlobal.py to use the timeout parameter and the audio-endpoint probe.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f8d5ada1-2dd5-4391-b02a-bf90a998d70c
📒 Files selected for processing (4)
nodes/src/nodes/embedding_video/IGlobal.pynodes/src/nodes/embedding_video/IInstance.pynodes/src/nodes/embedding_video/services.jsonnodes/src/nodes/tts_openai/IGlobal.py
| def _process_video(self, video_bytes: bytes): | ||
| """ | ||
| Extract frames from video bytes and generate embeddings for each frame. | ||
|
|
||
| Uses OpenCV to decode the video from an in-memory buffer, extracts | ||
| frames at the configured interval, and creates embedding documents | ||
| for each extracted frame. | ||
|
|
||
| Args: | ||
| video_bytes (bytes): The complete video file content. | ||
| """ | ||
| import cv2 | ||
| import tempfile | ||
| import os | ||
| from PIL import Image as PILImage | ||
|
|
||
| # Write video bytes to a temporary file for OpenCV to read. | ||
| # OpenCV's VideoCapture requires a file path or device index. | ||
| suffix = SUPPORTED_VIDEO_TYPES.get(self._mime_type, '.mp4') | ||
| tmp_fd, tmp_path = tempfile.mkstemp(suffix=suffix) | ||
| try: | ||
| with os.fdopen(tmp_fd, 'wb') as f: | ||
| f.write(video_bytes) | ||
|
|
||
| cap = cv2.VideoCapture(tmp_path) | ||
| try: | ||
| if not cap.isOpened(): | ||
| logger.error('Failed to open video file for frame extraction') | ||
| return | ||
|
|
||
| fps = cap.get(cv2.CAP_PROP_FPS) | ||
| if fps <= 0: | ||
| fps = 30.0 # Fallback to a reasonable default | ||
|
|
||
| total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) | ||
| video_duration = total_frames / fps | ||
|
|
||
| # Determine extraction boundaries. | ||
| start_time = self.IGlobal.start_time | ||
| duration = self.IGlobal.duration | ||
| if duration <= 0: | ||
| end_time = video_duration | ||
| else: | ||
| end_time = min(start_time + duration, video_duration) | ||
|
|
||
| interval = self.IGlobal.frame_interval | ||
| max_frames = self.IGlobal.max_frames | ||
|
|
||
| # Seek to start position if needed. | ||
| if start_time > 0: | ||
| start_frame = int(start_time * fps) | ||
| cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame) | ||
|
|
||
| frames_extracted = 0 | ||
| frame_interval_frames = max(1, int(interval * fps)) | ||
|
|
||
| current_frame_pos = int(start_time * fps) if start_time > 0 else 0 | ||
|
|
||
| while True: | ||
| # Check max frames limit. | ||
| if max_frames > 0 and frames_extracted >= max_frames: | ||
| break | ||
|
|
||
| # Check if we've gone past end time. | ||
| current_time = current_frame_pos / fps | ||
| if current_time >= end_time: | ||
| break | ||
|
|
||
| # Set position and read the frame. | ||
| cap.set(cv2.CAP_PROP_POS_FRAMES, current_frame_pos) | ||
| ret, frame = cap.read() | ||
| if not ret: | ||
| break | ||
|
|
||
| # Calculate the timestamp for this frame. | ||
| time_stamp = current_frame_pos / fps | ||
|
|
||
| # Convert BGR (OpenCV default) to RGB for the embedding model. | ||
| frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) | ||
|
|
||
| # Convert numpy array to PIL Image for the embedding model. | ||
| pil_image = PILImage.fromarray(frame_rgb) | ||
|
|
||
| # Generate the embedding with device lock for thread safety. | ||
| with self.IGlobal.device_lock: | ||
| embedding = self.IGlobal.embedding.create_image_embedding(pil_image) | ||
|
|
||
| # Encode the frame as base64 PNG for document storage. | ||
| _, png_buffer = cv2.imencode('.png', frame) | ||
| frame_base64 = base64.b64encode(png_buffer.tobytes()).decode('utf-8') | ||
|
|
||
| # Create metadata with frame number and timestamp. | ||
| metadata = DocMetadata( | ||
| self, | ||
| chunkId=self._frame_chunk_id, | ||
| isTable=False, | ||
| tableId=0, | ||
| isDeleted=False, | ||
| ) | ||
| metadata.time_stamp = time_stamp | ||
| metadata.frame_number = frames_extracted | ||
|
|
||
| # Create the document with the frame image and embedding. | ||
| doc = Doc(type='Image', page_content=frame_base64, metadata=metadata) | ||
| doc.embedding = embedding if isinstance(embedding, list) else embedding.tolist() | ||
| doc.embedding_model = self.IGlobal.embedding.model_name | ||
|
|
||
| # Pass the document to the next pipeline stage. | ||
| self.instance.writeDocuments([doc]) | ||
|
|
||
| self._frame_chunk_id += 1 | ||
| frames_extracted += 1 | ||
|
|
||
| # Advance to the next frame position. | ||
| current_frame_pos += frame_interval_frames | ||
|
|
||
| logger.info( | ||
| 'Video embedding complete: extracted %d frames, interval=%.1fs', | ||
| frames_extracted, | ||
| interval, | ||
| ) | ||
| finally: | ||
| cap.release() | ||
|
|
||
| finally: | ||
| # Clean up the temporary file. | ||
| if os.path.exists(tmp_path): | ||
| os.unlink(tmp_path) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Method complexity exceeds recommended threshold (68 > 50 statements).
Static analysis flags this method as too complex. While the logic is clear and well-commented, consider extracting:
- Frame extraction setup (lines 120-152) into
_setup_extraction(cap) -> tuple - Per-frame processing (lines 170-206) into
_process_frame(frame, current_frame_pos, fps) -> Doc
This is noted in a past review comment and remains a recommended refactor for maintainability.
🧰 Tools
🪛 Ruff (0.15.7)
[warning] 96-96: Too many statements (68 > 50)
(PLR0915)
[warning] 96-96: Missing return type annotation for private function _process_video
Add return type annotation: None
(ANN202)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/embedding_video/IInstance.py` around lines 96 - 223, The
_process_video method is too large; extract the setup and per-frame logic into
helpers to reduce complexity: pull the "Frame extraction setup" block that
computes fps, total_frames, video_duration, start_time/end_time, interval,
max_frames, start_frame, frame_interval_frames and current_frame_pos into a new
method named _setup_extraction(cap) that returns (fps, end_time,
frame_interval_frames, current_frame_pos, max_frames, interval) and use it from
_process_video; pull the per-frame steps that convert BGR->RGB, build PIL image,
call self.IGlobal.device_lock and self.IGlobal.embedding.create_image_embedding,
encode PNG, create DocMetadata and Doc, set embedding and embedding_model, and
call self.instance.writeDocuments into a new method named _process_frame(frame,
current_frame_pos, fps) that returns the created Doc (or None on failure), then
simplify the loop in _process_video to call _process_frame and handle
frames_extracted, _frame_chunk_id, and logging exactly as before so behavior
(use of DocMetadata, Doc, self.IGlobal.device_lock, embedding.model_name, and
self.instance.writeDocuments) remains unchanged.
There was a problem hiding this comment.
See reply on the other process_video complexity comment. Keeping it as one method for now since the logic is inherently sequential (extract frame -> embed -> create doc) with shared state.
There was a problem hiding this comment.
Rate Limit Exceeded
@charliegillet have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 16 seconds before sending another message.
| "openai-patch32": { | ||
| "title": "OpenAI CLIP - 32x32 - lower performance, better recognition", | ||
| "model": "openai/clip-vit-base-patch32" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the difference between CLIP ViT patch16 and patch32 models in terms of performance and accuracy?
💡 Result:
The CLIP ViT models with different patch sizes differ primarily in their image encoding granularity, model architecture details, computational efficiency, and resulting performance/accuracy. Key differences: - Patch32 models (e.g., ViT-B/32, openai/clip-vit-base-patch32): Use larger 32x32 pixel patches on 224x224 input images, resulting in fewer tokens (49 spatial tokens + CLS). This leads to lower computational cost (fewer tokens for attention) but coarser spatial resolution. Model size ~151M params. Zero-shot ImageNet top-1 accuracy ~63.2% (original CLIP paper summary via DeepWiki), ~59-60% with basic prompts on released model. - Patch16 models (e.g., ViT-B/16, openai/clip-vit-base-patch16): Use smaller 16x16 patches, yielding more tokens (196 spatial + CLS). Slightly similar param count (~149M) but higher compute due to more tokens. Better captures fine details. Zero-shot ImageNet top-1 ~68.3%. Smaller patch size (16 vs 32) generally improves accuracy by providing finer-grained image representations, at the cost of increased inference speed and memory usage (quadratic attention scaling). Patch16 is faster than even larger models like ViT-L/14 in some contexts but slower than patch32. Both outperform RN50 equivalents but trade off speed vs accuracy.
Citations:
- 1: https://deepwiki.com/openai/CLIP/3.3-model-variants-and-parameters
- 2: https://proceedings.mlr.press/v139/radford21a/radford21a.pdf
- 3: https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf
- 4: About the ImageNet zero-shot performance with the released models openai/CLIP#24
- 5: https://huggingface.co/openai/clip-vit-base-patch16
Fix inaccurate model description for openai-patch32.
The description "lower performance, better recognition" is incorrect. According to CLIP benchmarks, patch32 achieves ~59-60% zero-shot ImageNet accuracy, while patch16 achieves ~68.3%—patch32 has lower accuracy, not better recognition. Update the title to accurately reflect that patch32 prioritizes inference speed over recognition accuracy, e.g., "OpenAI CLIP - 32x32 - faster inference, lower accuracy" or similar.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nodes/src/nodes/embedding_video/services.json` around lines 94 - 96, Update
the "title" for the openai-patch32 entry to correct the misleading description:
change the title value for the "openai-patch32" JSON object (the "title" field)
to reflect that patch32 trades accuracy for speed, e.g., "OpenAI CLIP - 32x32 -
faster inference, lower accuracy" so it no longer claims "better recognition."
There was a problem hiding this comment.
The model choices (CLIP patch16/patch32 and Google ViT) are intentionally the same as embedding_image since they're the proven vision models in the codebase. The Embedding class from embedding_image handles both model families. Users can also specify a custom model via the 'custom' profile option.
There was a problem hiding this comment.
Rate Limit Exceeded
@charliegillet have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 8 seconds before sending another message.
asclearuc
left a comment
There was a problem hiding this comment.
Thank you for this contribution — reusing embedding_image.Embedding for frame-level
inference is the right architectural decision, and the AVI streaming protocol is
correctly implemented.
Inline comments have been left on specific lines.
One structural note: tts_openai/IGlobal.py is included in this PR but is a
completely separate node with no relation to video embedding. Please move it to its
own PR.
- Remove unrelated tts_openai/IGlobal.py (belongs in its own PR) - Use rocketlib debug/warning instead of Python logging module - Use os.path.join instead of string concatenation for requirements path - Add OPEN_MODE.CONFIG guard to skip model loading in config UI - Initialize self._mime_type in __init__ to prevent AttributeError - Use ai.common.opencv instead of direct cv2 import - Use ai.common.image.ImageProcessor instead of PIL directly - Batch writeDocuments into a single call instead of per-frame - Remove opencv-python-headless from requirements.txt (ai.common.opencv handles it) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
nodes/src/nodes/embedding_video/IInstance.py (2)
93-93: 🧹 Nitpick | 🔵 TrivialConsider adding return type annotation for consistency.
Static analysis flags missing return type annotation. While not critical, adding
-> Noneimproves type safety and IDE support.- def _process_video(self, video_bytes: bytes): + def _process_video(self, video_bytes: bytes) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/embedding_video/IInstance.py` at line 93, Add an explicit return type annotation -> None to the _process_video method in IInstance (def _process_video(self, video_bytes: bytes)) so the signature becomes def _process_video(self, video_bytes: bytes) -> None; this satisfies static analysis and improves type safety/IDE support without changing behavior.
78-91:⚠️ Potential issue | 🟡 MinorConsider raising an error when video size limit is exceeded.
When the video exceeds
max_video_size_bytes, a warning is logged (line 83) but the pipeline item completes with zero documents onAVI_ACTION.END. This silent completion might mask issues in production pipelines where users expect either documents or an explicit failure.Option: Raise an error instead of silent completion
elif action == AVI_ACTION.WRITE: if self._video_data is not None: max_size = self.IGlobal.max_video_size_bytes if len(self._video_data) + len(buffer) > max_size: max_mb = max_size / (1024 * 1024) - warning(f'Video exceeds maximum allowed size of {max_mb:.0f} MB, rejecting') - self._video_data = None - return + self._video_data = None + raise ValueError(f'Video exceeds maximum allowed size of {max_mb:.0f} MB') self._video_data += bufferAlternatively, if silent completion is the intended behavior, consider documenting this in the method docstring to clarify the design decision.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/embedding_video/IInstance.py` around lines 78 - 91, The current logic in the AVI_ACTION.WRITE branch silently drops video data when len(self._video_data) + len(buffer) > self.IGlobal.max_video_size_bytes; update this to raise a clear exception (e.g., ValueError or a custom VideoSizeExceededError) instead of just warning and clearing self._video_data so callers see a failure, and ensure the exception is raised before clearing self._video_data; reference AVI_ACTION.WRITE, self._video_data, self.IGlobal.max_video_size_bytes and preserve existing behavior for other branches (including calling self._process_video in AVI_ACTION.END) so failures propagate to the pipeline rather than producing zero documents.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nodes/src/nodes/embedding_video/IInstance.py`:
- Line 93: Add an explicit return type annotation -> None to the _process_video
method in IInstance (def _process_video(self, video_bytes: bytes)) so the
signature becomes def _process_video(self, video_bytes: bytes) -> None; this
satisfies static analysis and improves type safety/IDE support without changing
behavior.
- Around line 78-91: The current logic in the AVI_ACTION.WRITE branch silently
drops video data when len(self._video_data) + len(buffer) >
self.IGlobal.max_video_size_bytes; update this to raise a clear exception (e.g.,
ValueError or a custom VideoSizeExceededError) instead of just warning and
clearing self._video_data so callers see a failure, and ensure the exception is
raised before clearing self._video_data; reference AVI_ACTION.WRITE,
self._video_data, self.IGlobal.max_video_size_bytes and preserve existing
behavior for other branches (including calling self._process_video in
AVI_ACTION.END) so failures propagate to the pipeline rather than producing zero
documents.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ea0d17fb-99d7-43c4-9c59-45f9a3d9c46d
📒 Files selected for processing (3)
nodes/src/nodes/embedding_video/IGlobal.pynodes/src/nodes/embedding_video/IInstance.pynodes/src/nodes/embedding_video/requirements.txt
Addresses review feedback on PR rocketride-org#516 — when the video size limit is exceeded, the END action now logs a warning instead of silently producing zero documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressed all review feedback:
Note: Thanks for the thorough review! |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/embedding_video/IInstance.py`:
- Around line 185-194: The metadata currently sets metadata.frame_number =
frames_extracted (the sample ordinal); instead it must store the original source
frame index—replace frames_extracted with the variable that represents the video
frame index produced by the reader (the original frame
index/frame_id/frame_index used when extracting frames) when creating
DocMetadata (DocMetadata, metadata.frame_number, time_stamp); if such a variable
isn't available in this scope obtain it from the frame extraction call or
compute it from time_stamp and the video's fps and assign that value to
metadata.frame_number.
- Around line 88-93: The END branch in the AVI action handler copies the
bytearray into a bytes object and only clears self._video_data after
_process_video, doubling memory; fix by moving the buffer to a local variable
(e.g., assign video_bytes = bytes(self._video_data) or better assign buffer =
self._video_data), set self._video_data = None immediately, then call
self._process_video using the local variable (or have _process_video consume the
buffer) to avoid simultaneous duplicates; for large uploads consider spooling
chunks to disk during the WRITE handling as an alternative.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f68bee6b-5731-440a-bf17-d89324752523
📒 Files selected for processing (1)
nodes/src/nodes/embedding_video/IInstance.py
|
Please pay attention to CodeRabbit comments. |
- Avoid 2x memory usage on AVI_ACTION.END by moving buffer to local before processing and clearing the instance variable first - Use source frame index (current_frame_pos) for metadata.frame_number instead of extraction ordinal (frames_extracted) - Increase model name maxLength from 32 to 256 to accommodate longer HuggingFace model identifiers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@asclearuc Addressed all CodeRabbit feedback from both review rounds: Latest round (post-push):
Also fixed:
The remaining CodeRabbit comments from the first round (nitpicks about method complexity, redundant |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
nodes/src/nodes/embedding_video/services.json (2)
136-173:⚠️ Potential issue | 🟠 MajorSchema field names mismatch config lookups in IGlobal.py.
The field names use
embedding_video.prefix (e.g.,embedding_video.interval,embedding_video.max_frames), but based on past review analysis,IGlobal.pyretrieves values with unprefixed keys likeconfig.get('interval', 5). SincegetNodeConfig()does not strip prefixes, all frame extraction settings will use hardcoded defaults instead of user configuration.#!/bin/bash # Verify the config key mismatch between services.json and IGlobal.py echo "=== Field names in services.json ===" rg '"embedding_video\.' nodes/src/nodes/embedding_video/services.json | head -20 echo "" echo "=== Config lookups in IGlobal.py ===" rg "config\.get\(" nodes/src/nodes/embedding_video/IGlobal.py -B1 -A1 echo "" echo "=== Compare with working embedding_image pattern ===" rg "config\.get\(" nodes/src/nodes/embedding_image/IGlobal.py -B1 -A1 2>/dev/null || echo "No IGlobal.py found for embedding_image"
94-96:⚠️ Potential issue | 🟡 MinorFix inaccurate
openai-patch32profile description.The title "lower performance, better recognition" is misleading. CLIP patch32 achieves ~59-60% zero-shot ImageNet accuracy vs. ~68.3% for patch16—it has lower accuracy, not "better recognition." The tradeoff is faster inference speed.
🔧 Suggested fix
"openai-patch32": { - "title": "OpenAI CLIP - 32x32 - lower performance, better recognition", + "title": "OpenAI CLIP - 32x32 - faster inference, lower accuracy", "model": "openai/clip-vit-base-patch32" },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nodes/src/nodes/embedding_video/services.json` around lines 94 - 96, The "openai-patch32" profile title is misleading; update the "title" field for the "openai-patch32" object in services.json to accurately reflect that CLIP patch32 has lower accuracy but faster inference (e.g., "OpenAI CLIP - 32x32 - lower accuracy, faster inference"); locate the object by the key "openai-patch32" and replace the current title string accordingly while leaving the "model" value ("openai/clip-vit-base-patch32") unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nodes/src/nodes/embedding_video/IInstance.py`:
- Around line 88-95: The AVI_ACTION.END branch clears self._video_data into
local video_data but then mistakenly checks and uses self._video_data (always
None), preventing _process_video from ever running; fix by using the local
variable video_data for the null/length checks and when calling
self._process_video(bytes(video_data)), and only set self._video_data = None
after processing (or leave the initial capture and final clear as currently
intended) so valid video payloads are processed and the warning is only emitted
when video_data is None or empty.
---
Duplicate comments:
In `@nodes/src/nodes/embedding_video/services.json`:
- Around line 94-96: The "openai-patch32" profile title is misleading; update
the "title" field for the "openai-patch32" object in services.json to accurately
reflect that CLIP patch32 has lower accuracy but faster inference (e.g., "OpenAI
CLIP - 32x32 - lower accuracy, faster inference"); locate the object by the key
"openai-patch32" and replace the current title string accordingly while leaving
the "model" value ("openai/clip-vit-base-patch32") unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ba836264-ec47-47aa-a51d-fcf043f9db07
📒 Files selected for processing (2)
nodes/src/nodes/embedding_video/IInstance.pynodes/src/nodes/embedding_video/services.json
Senior Review — PR #516: feat(nodes): add video embedding node@charliegillet, useful addition — CLIP/ViT integration for video embeddings opens up good downstream use cases. The core approach is sound. There are several issues to address, roughly in priority order: Thread safety (high priority)Embedding model calls (CLIP/ViT inference) are not thread-safe on shared GPU/CPU devices. You need a with self._device_lock:
embeddings = model.encode(frames, ...)Temp file cleanup (high priority)Temporary files created during video frame extraction must be cleaned up in try:
# extract frames to temp files
# process frames
finally:
# clean up temp filesMemory implications for large videosProcessing large videos can consume significant memory when extracting and holding many frames simultaneously. Consider:
Config double-processingThere appears to be an issue where configuration is processed twice during initialization. Trace through the init path and ensure config parsing happens exactly once. The foundation is good. Thread safety and temp file cleanup are the most important fixes — address those first. |
🚀 Merge RequestUseful CLIP/ViT integration for video semantic search. Before merge:
Thread safety is the critical item. |
…iew feedback Fix the critical bug where self._video_data references after clearing caused _process_video to never be called. Also fix Embedding constructor receiving double-processed config, and align services.json field prefixes with the node's "embedding" prefix so config values are properly resolved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No description provided. |
|
Hi @asclearuc, thanks for the thorough review. Here's a summary of what's been addressed: New fixes in 8b922fd:
Previously fixed (confirmed working):
Not applicable to this PR:
All Python files pass |
Latest fixes
|
…eads Tracks the last successfully-read frame position so cap.set() is only called when the next requested frame isn't already the one VideoCapture would return next. Addresses CodeRabbit nitpick about per-iteration seeks adding overhead when frame_interval_frames is small relative to the video FPS. Also clarifies via comment that the existing BGR->PNG pipeline is correct: cv2.imencode treats its input as BGR and writes a standard PNG, which ImageProcessor then loads as a correct RGB PIL image for the embedding model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Thanks for the review. All your inline points have been addressed:
Ready for re-review. |
Addressed in commit |
No change — the current code is correct and I've added a clarifying comment in commit |
Acknowledged but not applied. The method reads top-to-bottom as a clear pipeline (open → configure → loop → emit → cleanup), and the proposed helpers would need to share |
No change — this is a false positive. The services.json fields use an |
Summary
embedding_videopipeline node that extracts frames from video input at configurable intervals and generates vector embeddings using CLIP or ViT vision modelswriteVideoAVI streaming interface (MP4, AVI, MOV, WebM) and outputs documents with frame embeddings, timestamps, and frame metadata compatible with all existing embedding store nodesembedding_image.Embeddingclass for model inference, ensuring consistency with the image embedding node and avoiding code duplicationType
Feature
Why this feature fits this codebase
RocketRide's pipeline architecture already supports image embedding via
nodes/src/nodes/embedding_image/, which defines anEmbeddingclass for CLIP/ViT model inference. The codebase uses a standardIGlobal/IInstance/services.jsonnode pattern whereIGlobal.beginGlobal()loads dependencies and models,IInstancehandles streaming data, andservices.jsondeclares lanes, preconfig profiles, and fields. Video is a natural extension of this pattern: the newembedding_videonode slots into the same pipeline topology (producingdocumentsfrom avideolane inservices.json) and reusesembedding_image.embedding.Embeddingdirectly for frame-level inference rather than duplicating model-loading logic. The AVI streaming protocol (AVI_ACTION.BEGIN/WRITE/END) is an existing RocketRide convention already used by other media nodes, so theIInstance.writeVideo()method follows an established contract. The outputDocobjects withembedding,embedding_model, andDocMetadatafields are the same schema consumed by downstream store nodes (Pinecone, Weaviate, Chroma), requiring zero changes to the rest of the pipeline.What changed
nodes/src/nodes/embedding_video/IGlobal.pyembedding_image.Embedding, stores frame extraction config (frame_interval,max_frames,start_time,duration,max_video_size_bytes), createsdevice_lockfor thread safetynodes/src/nodes/embedding_video/IInstance.pywriteVideo()(BEGIN/WRITE/END protocol), extracts frames with OpenCV at configured intervals in_process_video(), converts BGR to RGB PIL images, generates embeddings viaIGlobal.embedding.create_image_embedding(), outputsDocobjects with base64 PNG content, embedding vectors, andtime_stamp/frame_numbermetadatanodes/src/nodes/embedding_video/__init__.pyIGlobalandIInstancenodes/src/nodes/embedding_video/requirements.txttransformers,accelerate,opencv-python-headlessnodes/src/nodes/embedding_video/services.jsonembedding_video://protocol,video -> documentslane, preconfig profiles foropenai-patch16,openai-patch32,google16x224, andcustom, field definitions for interval/max_frames/start_time/duration/maxVideoSizeMB, and test case configValidation
openai-patch16profile and send an MP4 video through thewriteVideointerfaceminLength: 384max_frameslimit andstart_time/durationboundariestime_stampandframe_numberruff checkandruff format-- should pass cleanHow this could be extended
The same frame-extraction-then-embed pattern could be adapted for audio embedding (extracting spectrogram frames at intervals), GIF/animated image embedding, or multi-modal embedding where frame embeddings are combined with transcript embeddings from the same video. The
services.jsonpreconfig system makes it straightforward to add new vision models as profiles.Closes #410
#Hack-with-bay-2
Summary by CodeRabbit
embedding_video://registration for easy use.