-
Notifications
You must be signed in to change notification settings - Fork 0
Advanced user's guide
The GraphAI endpoints in /video, /image, /voice, /translation, are mainly designed for the purpose of processing videos and translating text, and use a common token-based caching logic. This page will describe the logic of tokens and caching, and provide an example of how these endpoints can be used.
The video processing endpoints, contained in /video, /image, and /audio endpoint groups, perform the tasks of video retrieval, audio and slide extraction, audio transcription, and slide text extraction.
The /translation endpoints currently perform en-fr, fr-en, it-en, and de-en translation.
Each of the aforementioned API endpoints returns either a token (or a list of tokens) or a final result.
A token is the identifier of a file. Tokens are the input to the following endpoints:
- All
/imageendpoints - All
/voiceendpoints - All
/videoendpoints except/video/retrieve_url
The input to the /video/retrieve_url endpoint is a URL, and all /translation endpoints receive a text (usually along with a source and a target language) as input.
Tokens are the output of the following endpoints:
-
video/retrieve_url(single token of video file) -
video/detect_slides(list of slide file tokens) -
video/extract_audio(single audio file token) - All
.../calculate_fingerprintendpoints, which return the fingerprint of the input token and the token of the closest match for the input token (or input text in case of/translation/calculate_fingerprint)
All the other endpoints return a final result as their output.
- Retrieving the video file
- POST
/video/retrieve_urlwith video url → task_id - GET
/video/retrieve_url/status/{task_id}→ video_token
- POST
- Detecting slides
- POST
/video/detect_slideswith video_token → task_id - GET
/video/detect_slides/status/{task_id}→ list of slide_tokens
- POST
- Extracting audio
- POST
/video/extract_audiowith video_token → task_id - GET
/video/extract_audio/status/{task_id}→ audio_token
- POST
- Performing OCR on a slide
- POST
/image/extract_textwith slide_token → task_id - GET
/image/extract_text/status/{task_id}→ Text extracted from slide
- POST
- Transcribing audio (creates full transcript + timestamped subtitles)
- POST
/voice/transcribewith audio_token → task_id - GET
/voice/transcribe/status/{task_id}→ Transcript and subtitles for audio
- POST
- Detecting the language of some text (OCR results or transcript)
- POST
/translation/detect_languagewith slide or audio text → task_id - GET
/translation/detect_language/status/{task_id}→ Text language
- POST
- Translating text
- POST
/translation/translatewith the text (+ src and tgt lang) → task_id - GET
/translation/translate/status/{task_id}→ Translated text
- POST
All the results of the aforementioned endpoints are cached using a two-layer caching system in order to avoid unnecessary recomputation. The caching tables live in the cache_graphai schema.
The first, basic layer of caching is based on the tokens themselves: if the results of the requested operation for this particular token already exist in the corresponding cache table, they are returned instantly without recomputation.
The second, more interesting layer is based on fingerprinting. All the endpoints that generate a token as their output, plus the /translation/translate endpoint, perform fingerprinting automatically before computing their own results. Fingerprinting allows the caching system to detect when a new token has the same content as an existing token, and thus to return the already-computed results for such a token. The fingerprinting algorithms used for images, audio files, and text are perceptual hashes, making them robust to slight changes that do not meaningfully alter content, e.g. the transcoding of media files or the omission of punctuation marks in text. In order words, they are capable of detecting non-exact matches. The video fingerprinting algorithm, however, is a simple MD5 hash, which can only detect exact matches.
The fingerprinting pipeline for tokens follows a common logic. It creates a "closest-match" graph for each group, which we can think of as a directed acyclic graph where each token is a node, and an edge between nodes B and A indicates that the closest match for token B is token A, and thus the results already computed for A will be returned from the cache for B. An additional constraint is that the out-degree of each node is at most 1, while there is no limit on the in-degree. The fingerprint lookup algorithm, which looks among existing fingerprints to find the closest match for a new token, chooses the oldest existing token if there are multiple matches. This helps ensure consistency. The voice and image closest match graphs inherit relationships from the video closest match graph. As a result, the logic of voice and image fingerprint lookups involves first resolving any chains of tokens inherited from video.
Most endpoints have a force flag that, if set to true (and if the token has not expired), will force a recomputation of the results and will skip the cache lookup. This flag is false by default and should be left that way for almost all cases, since we want cache hits, and the endpoints will all perform the computation anyway in case of a cache miss.
Tokens persist permanently in the cache database. However, their corresponding files may be deleted, either through a deliberate cleanup or through a deletion of the server where they reside (which is the case for the RCP server). The active flag in a token's token_status value (when calling endpoints that return a token: /retrieve_url, /detect_slides, and /extract_audio) indicates whether the file for this token is available. For an inactive token, new computations are impossible and only cached results may be returned.
Of course, sometimes you may want to perform computations on a cached but inactive token. Here is what you can do to 'reactivate' the token:
- Video token: Call
/video/retrieve_urlwith the token's original URL, withforce=True. - Slide tokens: Call
/video/detect_slideswith the video token and withrecalculate_cached=True. - Audio token: Call
/video/extract_audiowith the video token and withrecalculate_cached=True.