Summary
markitdown 0.0.2 cannot extract the transcript of the majority of YouTube URLs in the wild, because of three compounding bugs in _markitdown.py's YouTube handler. I hit all three trying to convert https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7, which returned only metadata (title, keywords, description) and no transcript section at all, with no error shown to the user.
Related but distinct from #1232 (user-side code using new API) and #1291 (just a note about upgrading the library).
Environment
markitdown 0.0.2 (installed via pipx install 'markitdown[all]')
youtube-transcript-api 1.2.4 (pulled in transitively by markitdown[all])
- Python 3.14.3 / macOS 26.3.1 arm64
- Test video:
https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7 (has auto-generated English captions, confirmed working via direct API call)
Reproduction
pipx install 'markitdown[all]'
markitdown "https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7" -o out.md
Expected: out.md contains a ### Transcript section with the video's captions.
Actual: out.md contains only metadata — no transcript, no error, no hint that anything failed:
# YouTube
## What is Claude Managed Agents?
### Video Metadata
- **Keywords:** video, sharing, camera phone, video phone, free, upload
- **Runtime:** PT3M53S
### Description
Claude Managed Agents is a suite of APIs...
Direct API call confirms the transcript is available:
from youtube_transcript_api import YouTubeTranscriptApi
api = YouTubeTranscriptApi()
ft = api.fetch("NLWiIj47IdI", languages=["en"])
print(len(list(ft))) # → 97 segments, 3:52 total
Root causes (three distinct bugs in packages/markitdown/src/markitdown/_markitdown.py)
Bug 1 — Short URLs youtu.be/<id> are never matched
Line ~519:
parsed_url = urlparse(url)
params = parse_qs(parsed_url.query)
if "v" in params:
video_id = str(params["v"][0])
For https://youtu.be/NLWiIj47IdI?si=xxx, the query string is si=xxx, so "v" not in params and the entire transcript branch is silently skipped. youtu.be/<id> is the canonical share URL that YouTube itself hands out — it needs to be parsed from parsed_url.path, not parsed_url.query.
Suggested fix:
video_id = None
if parsed_url.hostname in ("youtu.be",):
video_id = parsed_url.path.lstrip("/").split("/", 1)[0] or None
elif "v" in params:
video_id = str(params["v"][0])
# also worth handling youtube.com/shorts/<id>, /embed/<id>, /live/<id>
if video_id:
...
Bug 2 — Uses YouTubeTranscriptApi.get_transcript(...), which was removed in youtube-transcript-api 1.0
Line ~527:
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=youtube_transcript_languages)
transcript_text = " ".join([part["text"] for part in transcript])
In youtube-transcript-api 1.x (markitdown[all] currently pins 1.2.4):
get_transcript is gone — there is no static/classmethod entry point.
- The new API requires instantiation:
api = YouTubeTranscriptApi(); ft = api.fetch(video_id, languages=[...]).
- The result is a
FetchedTranscript object, iterable of FetchedTranscriptSnippet objects with attributes, not dicts: snip.text, snip.start, snip.duration — so part["text"] also breaks.
Suggested fix:
api = YouTubeTranscriptApi()
ft = api.fetch(video_id, languages=youtube_transcript_languages)
transcript_text = " ".join(snip.text for snip in ft)
Alternatively, pin youtube-transcript-api<1.0 in markitdown[all] — but that punts the problem and locks out future fixes from jdepoix/youtube-transcript-api.
Bug 3 — except Exception: pass silently eats every failure
Line ~532:
try:
transcript = YouTubeTranscriptApi.get_transcript(...)
transcript_text = " ".join(...)
except Exception:
pass
With bugs 1 and 2, users see no transcript and no reason why. The failure should at minimum be logged (to stderr or the logging module), so downstream users can figure out that captions were attempted and failed, versus never attempted.
Suggested fix:
import logging
log = logging.getLogger(__name__)
try:
...
except Exception as exc:
log.warning("YouTube transcript fetch failed for %s: %s", video_id, exc)
Impact
Every short-link YouTube URL (which is what the YouTube share button produces, i.e. the default URL format users paste) silently loses its transcript. This affects most real-world use of markitdown for YouTube content.
Minimal patch (all three bugs together)
# _markitdown.py, inside _convert for YouTube
import logging
log = logging.getLogger(__name__)
if IS_YOUTUBE_TRANSCRIPT_CAPABLE:
transcript_text = ""
parsed_url = urlparse(url)
params = parse_qs(parsed_url.query)
video_id = None
if parsed_url.hostname in ("youtu.be",):
video_id = parsed_url.path.lstrip("/").split("/", 1)[0] or None
elif "v" in params:
video_id = str(params["v"][0])
# TODO: also handle /shorts/<id>, /embed/<id>, /live/<id>
if video_id:
try:
languages = kwargs.get("youtube_transcript_languages", ("en",))
api = YouTubeTranscriptApi()
ft = api.fetch(video_id, languages=languages)
transcript_text = " ".join(snip.text for snip in ft)
except Exception as exc:
log.warning("YouTube transcript fetch failed for %s: %s", video_id, exc)
if transcript_text:
webpage_text += f"\n### Transcript\n{transcript_text}\n"
Happy to open a PR if maintainers confirm this is the right direction.
Summary
markitdown0.0.2 cannot extract the transcript of the majority of YouTube URLs in the wild, because of three compounding bugs in_markitdown.py's YouTube handler. I hit all three trying to converthttps://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7, which returned only metadata (title, keywords, description) and no transcript section at all, with no error shown to the user.Related but distinct from #1232 (user-side code using new API) and #1291 (just a note about upgrading the library).
Environment
markitdown 0.0.2(installed viapipx install 'markitdown[all]')youtube-transcript-api 1.2.4(pulled in transitively bymarkitdown[all])https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7(has auto-generated English captions, confirmed working via direct API call)Reproduction
Expected:
out.mdcontains a### Transcriptsection with the video's captions.Actual:
out.mdcontains only metadata — no transcript, no error, no hint that anything failed:Direct API call confirms the transcript is available:
Root causes (three distinct bugs in
packages/markitdown/src/markitdown/_markitdown.py)Bug 1 — Short URLs
youtu.be/<id>are never matchedLine ~519:
For
https://youtu.be/NLWiIj47IdI?si=xxx, the query string issi=xxx, so"v" not in paramsand the entire transcript branch is silently skipped.youtu.be/<id>is the canonical share URL that YouTube itself hands out — it needs to be parsed fromparsed_url.path, notparsed_url.query.Suggested fix:
Bug 2 — Uses
YouTubeTranscriptApi.get_transcript(...), which was removed inyoutube-transcript-api1.0Line ~527:
In
youtube-transcript-api1.x (markitdown[all]currently pins 1.2.4):get_transcriptis gone — there is no static/classmethod entry point.api = YouTubeTranscriptApi(); ft = api.fetch(video_id, languages=[...]).FetchedTranscriptobject, iterable ofFetchedTranscriptSnippetobjects with attributes, not dicts:snip.text,snip.start,snip.duration— sopart["text"]also breaks.Suggested fix:
Alternatively, pin
youtube-transcript-api<1.0inmarkitdown[all]— but that punts the problem and locks out future fixes from jdepoix/youtube-transcript-api.Bug 3 —
except Exception: passsilently eats every failureLine ~532:
With bugs 1 and 2, users see no transcript and no reason why. The failure should at minimum be logged (to stderr or the
loggingmodule), so downstream users can figure out that captions were attempted and failed, versus never attempted.Suggested fix:
Impact
Every short-link YouTube URL (which is what the YouTube share button produces, i.e. the default URL format users paste) silently loses its transcript. This affects most real-world use of
markitdownfor YouTube content.Minimal patch (all three bugs together)
Happy to open a PR if maintainers confirm this is the right direction.