Skip to content

Advanced user's guide

Ramtin Yazdanian edited this page Apr 29, 2024 · 1 revision

The GraphAI endpoints in /video, /image, /voice, /translation, are mainly designed for the purpose of processing videos and translating text, and use a common token-based caching logic. This page will describe the logic of tokens and caching, and provide an example of how these endpoints can be used.

Functionalities

Video processing

The video processing endpoints, contained in /video, /image, and /audio endpoint groups, perform the tasks of video retrieval, audio and slide extraction, audio transcription, and slide text extraction.

Translation

The /translation endpoints currently perform en-fr, fr-en, it-en, and de-en translation.

Tokens and caching

Tokens and final results

Each of the aforementioned API endpoints returns either a token (or a list of tokens) or a final result.

A token is the identifier of a file. Tokens are the input to the following endpoints:

  • All /image endpoints
  • All /voice endpoints
  • All /video endpoints except /video/retrieve_url

The input to the /video/retrieve_url endpoint is a URL, and all /translation endpoints receive a text (usually along with a source and a target language) as input.

Tokens are the output of the following endpoints:

  • video/retrieve_url (single token of video file)
  • video/detect_slides (list of slide file tokens)
  • video/extract_audio (single audio file token)
  • All .../calculate_fingerprint endpoints, which return the fingerprint of the input token and the token of the closest match for the input token (or input text in case of /translation/calculate_fingerprint)

All the other endpoints return a final result as their output.

Example usage

Fully processing a video

  1. Retrieving the video file
    • POST /video/retrieve_url with video url → task_id
    • GET /video/retrieve_url/status/{task_id} → video_token
  2. Detecting slides
    • POST /video/detect_slides with video_token → task_id
    • GET /video/detect_slides/status/{task_id} → list of slide_tokens
  3. Extracting audio
    • POST /video/extract_audio with video_token → task_id
    • GET /video/extract_audio/status/{task_id} → audio_token
  4. Performing OCR on a slide
    • POST /image/extract_text with slide_token → task_id
    • GET /image/extract_text/status/{task_id} → Text extracted from slide
  5. Transcribing audio (creates full transcript + timestamped subtitles)
    • POST /voice/transcribe with audio_token → task_id
    • GET /voice/transcribe/status/{task_id} → Transcript and subtitles for audio
  6. Detecting the language of some text (OCR results or transcript)
    • POST /translation/detect_language with slide or audio text → task_id
    • GET /translation/detect_language/status/{task_id} → Text language
  7. Translating text
    • POST /translation/translate with the text (+ src and tgt lang) → task_id
    • GET /translation/translate/status/{task_id} → Translated text

Advanced guide

Caching logic

All the results of the aforementioned endpoints are cached using a two-layer caching system in order to avoid unnecessary recomputation. The caching tables live in the cache_graphai schema.

The first, basic layer of caching is based on the tokens themselves: if the results of the requested operation for this particular token already exist in the corresponding cache table, they are returned instantly without recomputation.

The second, more interesting layer is based on fingerprinting. All the endpoints that generate a token as their output, plus the /translation/translate endpoint, perform fingerprinting automatically before computing their own results. Fingerprinting allows the caching system to detect when a new token has the same content as an existing token, and thus to return the already-computed results for such a token. The fingerprinting algorithms used for images, audio files, and text are perceptual hashes, making them robust to slight changes that do not meaningfully alter content, e.g. the transcoding of media files or the omission of punctuation marks in text. In order words, they are capable of detecting non-exact matches. The video fingerprinting algorithm, however, is a simple MD5 hash, which can only detect exact matches.

The fingerprinting pipeline for tokens follows a common logic. It creates a "closest-match" graph for each group, which we can think of as a directed acyclic graph where each token is a node, and an edge between nodes B and A indicates that the closest match for token B is token A, and thus the results already computed for A will be returned from the cache for B. An additional constraint is that the out-degree of each node is at most 1, while there is no limit on the in-degree. The fingerprint lookup algorithm, which looks among existing fingerprints to find the closest match for a new token, chooses the oldest existing token if there are multiple matches. This helps ensure consistency. The voice and image closest match graphs inherit relationships from the video closest match graph. As a result, the logic of voice and image fingerprint lookups involves first resolving any chains of tokens inherited from video.

Forced recomputations

Most endpoints have a force flag that, if set to true (and if the token has not expired), will force a recomputation of the results and will skip the cache lookup. This flag is false by default and should be left that way for almost all cases, since we want cache hits, and the endpoints will all perform the computation anyway in case of a cache miss.

Token expiration and reactivating tokens

Tokens persist permanently in the cache database. However, their corresponding files may be deleted, either through a deliberate cleanup or through a deletion of the server where they reside (which is the case for the RCP server). The active flag in a token's token_status value (when calling endpoints that return a token: /retrieve_url, /detect_slides, and /extract_audio) indicates whether the file for this token is available. For an inactive token, new computations are impossible and only cached results may be returned.

Of course, sometimes you may want to perform computations on a cached but inactive token. Here is what you can do to 'reactivate' the token:

  • Video token: Call /video/retrieve_url with the token's original URL, with force=True.
  • Slide tokens: Call /video/detect_slides with the video token and with recalculate_cached=True.
  • Audio token: Call /video/extract_audio with the video token and with recalculate_cached=True.

Clone this wiki locally