Skip to content

YouTube converter: short URLs (youtu.be/…) silently skipped, and get_transcript() removed in youtube-transcript-api 1.x #1704

@hailanlan0577

Description

@hailanlan0577

Summary

markitdown 0.0.2 cannot extract the transcript of the majority of YouTube URLs in the wild, because of three compounding bugs in _markitdown.py's YouTube handler. I hit all three trying to convert https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7, which returned only metadata (title, keywords, description) and no transcript section at all, with no error shown to the user.

Related but distinct from #1232 (user-side code using new API) and #1291 (just a note about upgrading the library).

Environment

  • markitdown 0.0.2 (installed via pipx install 'markitdown[all]')
  • youtube-transcript-api 1.2.4 (pulled in transitively by markitdown[all])
  • Python 3.14.3 / macOS 26.3.1 arm64
  • Test video: https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7 (has auto-generated English captions, confirmed working via direct API call)

Reproduction

pipx install 'markitdown[all]'
markitdown "https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7" -o out.md

Expected: out.md contains a ### Transcript section with the video's captions.

Actual: out.md contains only metadata — no transcript, no error, no hint that anything failed:

# YouTube

## What is Claude Managed Agents?

### Video Metadata
- **Keywords:** video, sharing, camera phone, video phone, free, upload
- **Runtime:** PT3M53S

### Description
Claude Managed Agents is a suite of APIs...

Direct API call confirms the transcript is available:

from youtube_transcript_api import YouTubeTranscriptApi
api = YouTubeTranscriptApi()
ft = api.fetch("NLWiIj47IdI", languages=["en"])
print(len(list(ft)))  # → 97 segments, 3:52 total

Root causes (three distinct bugs in packages/markitdown/src/markitdown/_markitdown.py)

Bug 1 — Short URLs youtu.be/<id> are never matched

Line ~519:

parsed_url = urlparse(url)
params = parse_qs(parsed_url.query)
if "v" in params:
    video_id = str(params["v"][0])

For https://youtu.be/NLWiIj47IdI?si=xxx, the query string is si=xxx, so "v" not in params and the entire transcript branch is silently skipped. youtu.be/<id> is the canonical share URL that YouTube itself hands out — it needs to be parsed from parsed_url.path, not parsed_url.query.

Suggested fix:

video_id = None
if parsed_url.hostname in ("youtu.be",):
    video_id = parsed_url.path.lstrip("/").split("/", 1)[0] or None
elif "v" in params:
    video_id = str(params["v"][0])
# also worth handling youtube.com/shorts/<id>, /embed/<id>, /live/<id>
if video_id:
    ...

Bug 2 — Uses YouTubeTranscriptApi.get_transcript(...), which was removed in youtube-transcript-api 1.0

Line ~527:

transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=youtube_transcript_languages)
transcript_text = " ".join([part["text"] for part in transcript])

In youtube-transcript-api 1.x (markitdown[all] currently pins 1.2.4):

  • get_transcript is gone — there is no static/classmethod entry point.
  • The new API requires instantiation: api = YouTubeTranscriptApi(); ft = api.fetch(video_id, languages=[...]).
  • The result is a FetchedTranscript object, iterable of FetchedTranscriptSnippet objects with attributes, not dicts: snip.text, snip.start, snip.duration — so part["text"] also breaks.

Suggested fix:

api = YouTubeTranscriptApi()
ft = api.fetch(video_id, languages=youtube_transcript_languages)
transcript_text = " ".join(snip.text for snip in ft)

Alternatively, pin youtube-transcript-api<1.0 in markitdown[all] — but that punts the problem and locks out future fixes from jdepoix/youtube-transcript-api.

Bug 3 — except Exception: pass silently eats every failure

Line ~532:

try:
    transcript = YouTubeTranscriptApi.get_transcript(...)
    transcript_text = " ".join(...)
except Exception:
    pass

With bugs 1 and 2, users see no transcript and no reason why. The failure should at minimum be logged (to stderr or the logging module), so downstream users can figure out that captions were attempted and failed, versus never attempted.

Suggested fix:

import logging
log = logging.getLogger(__name__)
try:
    ...
except Exception as exc:
    log.warning("YouTube transcript fetch failed for %s: %s", video_id, exc)

Impact

Every short-link YouTube URL (which is what the YouTube share button produces, i.e. the default URL format users paste) silently loses its transcript. This affects most real-world use of markitdown for YouTube content.

Minimal patch (all three bugs together)

# _markitdown.py, inside _convert for YouTube
import logging
log = logging.getLogger(__name__)

if IS_YOUTUBE_TRANSCRIPT_CAPABLE:
    transcript_text = ""
    parsed_url = urlparse(url)
    params = parse_qs(parsed_url.query)

    video_id = None
    if parsed_url.hostname in ("youtu.be",):
        video_id = parsed_url.path.lstrip("/").split("/", 1)[0] or None
    elif "v" in params:
        video_id = str(params["v"][0])
    # TODO: also handle /shorts/<id>, /embed/<id>, /live/<id>

    if video_id:
        try:
            languages = kwargs.get("youtube_transcript_languages", ("en",))
            api = YouTubeTranscriptApi()
            ft = api.fetch(video_id, languages=languages)
            transcript_text = " ".join(snip.text for snip in ft)
        except Exception as exc:
            log.warning("YouTube transcript fetch failed for %s: %s", video_id, exc)

    if transcript_text:
        webpage_text += f"\n### Transcript\n{transcript_text}\n"

Happy to open a PR if maintainers confirm this is the right direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions