YouTube converter: short URLs (youtu.be/…) silently skipped, and get_transcript() removed in youtube-transcript-api 1.x

## Summary

`markitdown` 0.0.2 cannot extract the transcript of the majority of YouTube URLs in the wild, because of three compounding bugs in `_markitdown.py`'s YouTube handler. I hit all three trying to convert `https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7`, which returned only metadata (title, keywords, description) and no transcript section at all, with no error shown to the user.

Related but distinct from #1232 (user-side code using new API) and #1291 (just a note about upgrading the library).

## Environment

- `markitdown 0.0.2` (installed via `pipx install 'markitdown[all]'`)
- `youtube-transcript-api 1.2.4` (pulled in transitively by `markitdown[all]`)
- Python 3.14.3 / macOS 26.3.1 arm64
- Test video: `https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7` (has auto-generated English captions, confirmed working via direct API call)

## Reproduction

```bash
pipx install 'markitdown[all]'
markitdown "https://youtu.be/NLWiIj47IdI?si=m5-5FQ63O7kbvKz7" -o out.md
```

**Expected:** `out.md` contains a `### Transcript` section with the video's captions.

**Actual:** `out.md` contains only metadata — no transcript, no error, no hint that anything failed:

```markdown
# YouTube

## What is Claude Managed Agents?

### Video Metadata
- **Keywords:** video, sharing, camera phone, video phone, free, upload
- **Runtime:** PT3M53S

### Description
Claude Managed Agents is a suite of APIs...
```

Direct API call confirms the transcript is available:

```python
from youtube_transcript_api import YouTubeTranscriptApi
api = YouTubeTranscriptApi()
ft = api.fetch("NLWiIj47IdI", languages=["en"])
print(len(list(ft)))  # → 97 segments, 3:52 total
```

## Root causes (three distinct bugs in `packages/markitdown/src/markitdown/_markitdown.py`)

### Bug 1 — Short URLs `youtu.be/<id>` are never matched

Line ~519:

```python
parsed_url = urlparse(url)
params = parse_qs(parsed_url.query)
if "v" in params:
    video_id = str(params["v"][0])
```

For `https://youtu.be/NLWiIj47IdI?si=xxx`, the query string is `si=xxx`, so `"v" not in params` and the entire transcript branch is silently skipped. `youtu.be/<id>` is the canonical share URL that YouTube itself hands out — it needs to be parsed from `parsed_url.path`, not `parsed_url.query`.

Suggested fix:

```python
video_id = None
if parsed_url.hostname in ("youtu.be",):
    video_id = parsed_url.path.lstrip("/").split("/", 1)[0] or None
elif "v" in params:
    video_id = str(params["v"][0])
# also worth handling youtube.com/shorts/<id>, /embed/<id>, /live/<id>
if video_id:
    ...
```

### Bug 2 — Uses `YouTubeTranscriptApi.get_transcript(...)`, which was removed in `youtube-transcript-api` 1.0

Line ~527:

```python
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=youtube_transcript_languages)
transcript_text = " ".join([part["text"] for part in transcript])
```

In `youtube-transcript-api` 1.x (`markitdown[all]` currently pins 1.2.4):

- `get_transcript` is **gone** — there is no static/classmethod entry point.
- The new API requires **instantiation**: `api = YouTubeTranscriptApi(); ft = api.fetch(video_id, languages=[...])`.
- The result is a `FetchedTranscript` object, iterable of `FetchedTranscriptSnippet` **objects with attributes**, not dicts: `snip.text`, `snip.start`, `snip.duration` — so `part["text"]` also breaks.

Suggested fix:

```python
api = YouTubeTranscriptApi()
ft = api.fetch(video_id, languages=youtube_transcript_languages)
transcript_text = " ".join(snip.text for snip in ft)
```

Alternatively, pin `youtube-transcript-api<1.0` in `markitdown[all]` — but that punts the problem and locks out future fixes from jdepoix/youtube-transcript-api.

### Bug 3 — `except Exception: pass` silently eats every failure

Line ~532:

```python
try:
    transcript = YouTubeTranscriptApi.get_transcript(...)
    transcript_text = " ".join(...)
except Exception:
    pass
```

With bugs 1 and 2, users see no transcript and no reason why. The failure should at minimum be logged (to stderr or the `logging` module), so downstream users can figure out that captions were attempted and failed, versus never attempted.

Suggested fix:

```python
import logging
log = logging.getLogger(__name__)
try:
    ...
except Exception as exc:
    log.warning("YouTube transcript fetch failed for %s: %s", video_id, exc)
```

## Impact

Every short-link YouTube URL (which is what the YouTube share button produces, i.e. the *default* URL format users paste) silently loses its transcript. This affects most real-world use of `markitdown` for YouTube content.

## Minimal patch (all three bugs together)

```python
# _markitdown.py, inside _convert for YouTube
import logging
log = logging.getLogger(__name__)

if IS_YOUTUBE_TRANSCRIPT_CAPABLE:
    transcript_text = ""
    parsed_url = urlparse(url)
    params = parse_qs(parsed_url.query)

    video_id = None
    if parsed_url.hostname in ("youtu.be",):
        video_id = parsed_url.path.lstrip("/").split("/", 1)[0] or None
    elif "v" in params:
        video_id = str(params["v"][0])
    # TODO: also handle /shorts/<id>, /embed/<id>, /live/<id>

    if video_id:
        try:
            languages = kwargs.get("youtube_transcript_languages", ("en",))
            api = YouTubeTranscriptApi()
            ft = api.fetch(video_id, languages=languages)
            transcript_text = " ".join(snip.text for snip in ft)
        except Exception as exc:
            log.warning("YouTube transcript fetch failed for %s: %s", video_id, exc)

    if transcript_text:
        webpage_text += f"\n### Transcript\n{transcript_text}\n"
```

Happy to open a PR if maintainers confirm this is the right direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YouTube converter: short URLs (youtu.be/…) silently skipped, and get_transcript() removed in youtube-transcript-api 1.x #1704

Summary

Environment

Reproduction

Root causes (three distinct bugs in `packages/markitdown/src/markitdown/_markitdown.py`)

Bug 1 — Short URLs `youtu.be/<id>` are never matched

Bug 2 — Uses `YouTubeTranscriptApi.get_transcript(...)`, which was removed in `youtube-transcript-api` 1.0

Bug 3 — `except Exception: pass` silently eats every failure

Impact

Minimal patch (all three bugs together)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

YouTube converter: short URLs (youtu.be/…) silently skipped, and get_transcript() removed in youtube-transcript-api 1.x #1704

Description

Summary

Environment

Reproduction

Root causes (three distinct bugs in packages/markitdown/src/markitdown/_markitdown.py)

Bug 1 — Short URLs youtu.be/<id> are never matched

Bug 2 — Uses YouTubeTranscriptApi.get_transcript(...), which was removed in youtube-transcript-api 1.0

Bug 3 — except Exception: pass silently eats every failure

Impact

Minimal patch (all three bugs together)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Root causes (three distinct bugs in `packages/markitdown/src/markitdown/_markitdown.py`)

Bug 1 — Short URLs `youtu.be/<id>` are never matched

Bug 2 — Uses `YouTubeTranscriptApi.get_transcript(...)`, which was removed in `youtube-transcript-api` 1.0

Bug 3 — `except Exception: pass` silently eats every failure