baoyu-youtube-transcript
jimliu/baoyu-skills
Download YouTube transcripts, subtitles, and cover images with multi-language support and speaker identification.
What is baoyu-youtube-transcript?
Extracts transcripts, captions, and metadata from YouTube videos by URL or video ID. Supports multiple languages, translation, chapters, and speaker identification with intelligent caching for fast re-formatting. Use when a user asks for YouTube transcripts, subtitles, captions, or cover images.
- Downloads transcripts from manually created or auto-generated YouTube captions
- Fetches video metadata and cover images, cached for fast re-formatting
- Supports multiple output formats: markdown with timestamps, SRT subtitle files, or plain text
- Segments transcripts by chapters from video description and identifies speaker boundaries
- Translates transcripts to specified languages and lists available caption options
- Falls back to yt-dlp when YouTube blocks direct API access
How to install baoyu-youtube-transcript
npx skills add https://github.com/jimliu/baoyu-skills --skill baoyu-youtube-transcript- bun or npx installed (for running TypeScript scripts)
- yt-dlp available as fallback when YouTube blocks direct API access (optional but recommended)
How to use baoyu-youtube-transcript
- 1.Run with --list first to see available transcripts: ${BUN_X} {baseDir}/scripts/main.ts '<youtube-url>' --list
- 2.Execute the main command with desired options: ${BUN_X} {baseDir}/scripts/main.ts '<youtube-url>' --chapters --speakers
- 3.For multiple languages, specify priority order: ${BUN_X} {baseDir}/scripts/main.ts '<youtube-url>' --languages zh,en,ja
- 4.Use --format srt to generate subtitle files instead of markdown
- 5.For speaker identification, run with --speakers flag and post-process the output with AI if needed
- 6.Cached data is automatically reused for subsequent runs on the same video; use --refresh to force re-fetch
Use cases
- Extract a transcript from a YouTube video URL with timestamps and chapter markers
- Download subtitles in SRT format for use in video editing or players
- Get a video's cover image and metadata without downloading the full video
- Translate a YouTube transcript to a different language for accessibility
- Identify speakers in a multi-speaker video transcript for better readability
- Content creators and researchers analyzing video content
- Accessibility specialists creating captions and transcripts
- Video editors and producers working with subtitle files
- Multilingual teams needing translated transcripts
- Developers building tools that consume YouTube transcript data
baoyu-youtube-transcript FAQ
Full URLs (youtube.com/watch?v=ID), short URLs (youtu.be/ID), embed URLs, Shorts URLs, and bare video IDs are all accepted.
No. The script uses YouTube's InnerTube API directly and falls back to yt-dlp if blocked, requiring no authentication.
On first fetch, transcripts and metadata are cached in youtube-transcript/ directory. Subsequent runs reuse cached data unless --refresh is used or a different language is requested.
The --speakers flag outputs raw transcript with speaker boundaries. Full speaker identification requires AI post-processing to extract and label speaker names from the video description and transcript.
Any language code supported by YouTube's caption system. Use --list to see available transcripts for a specific video, or --translate <code> to translate to a target language.
Full instructions (SKILL.md)
Source of truth, from jimliu/baoyu-skills.
name: baoyu-youtube-transcript description: Downloads YouTube video transcripts/subtitles and cover images by URL or video ID. Supports multiple languages, translation, chapters, and speaker identification. Caches raw data for fast re-formatting. Use when user asks to "get YouTube transcript", "download subtitles", "get captions", "YouTube字幕", "YouTube封面", "视频封面", "video thumbnail", "video cover image", or provides a YouTube URL and wants the transcript/subtitle text or cover image extracted. version: 1.1.0 metadata: openclaw: homepage: https://github.com/JimLiu/baoyu-skills#baoyu-youtube-transcript requires: anyBins: - bun - npx
YouTube Transcript
Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly and automatically falls back to yt-dlp when YouTube blocks the direct API path.
Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.
Script Directory
Scripts in scripts/ subdirectory. {baseDir} = this SKILL.md's directory path. Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun. Replace {baseDir} and ${BUN_X} with actual values.
| Script | Purpose |
|---|---|
scripts/main.ts | Transcript download CLI |
Usage
# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>
# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja
# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps
# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters
# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers
# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt
# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans
# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts <url> --list
# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh
Options
| Option | Description | Default |
|---|---|---|
<url-or-id> | YouTube URL or video ID (multiple allowed) | Required |
--languages <codes> | Language codes, comma-separated, in priority order | en |
--format <fmt> | Output format: text, srt | text |
--translate <code> | Translate to specified language code | |
--list | List available transcripts instead of fetching | |
--timestamps | Include [HH:MM:SS → HH:MM:SS] timestamps per paragraph | on |
--no-timestamps | Disable timestamps | |
--chapters | Chapter segmentation from video description | |
--speakers | Raw transcript with metadata for speaker identification | |
--exclude-generated | Skip auto-generated transcripts | |
--exclude-manually-created | Skip manually created transcripts | |
--refresh | Force re-fetch, ignore cached data | |
-o, --output <path> | Save to specific file path | auto-generated |
--output-dir <dir> | Base output directory | youtube-transcript |
Optional Environment Variables
| Variable | Description |
|---|---|
YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER | Passed to yt-dlp --cookies-from-browser during fallback, e.g. chrome, safari, firefox, or chrome:Profile 1 |
Input Formats
Accepts any of these as video input:
- Full URL:
https://www.youtube.com/watch?v=dQw4w9WgXcQ - Short URL:
https://youtu.be/dQw4w9WgXcQ - Embed URL:
https://www.youtube.com/embed/dQw4w9WgXcQ - Shorts URL:
https://www.youtube.com/shorts/dQw4w9WgXcQ - Video ID:
dQw4w9WgXcQ
Output Formats
| Format | Extension | Description |
|---|---|---|
text | .md | Markdown with frontmatter (incl. description), title heading, summary, optional TOC/cover/timestamps/chapters/speakers |
srt | .srt | SubRip subtitle format for video players |
Output Directory
youtube-transcript/
├── .index.json # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
├── meta.json # Video metadata (title, channel, description, duration, chapters, etc.)
├── transcript-raw.json # Raw transcript snippets from YouTube API (cached)
├── transcript-sentences.json # Sentence-segmented transcript (split by punctuation, merged across snippets)
├── imgs/
│ └── cover.jpg # Video thumbnail
├── transcript.md # Markdown transcript (generated from sentences)
└── transcript.srt # SRT subtitle (generated from raw snippets, if --format srt)
{channel-slug}: Channel name in kebab-case{title-full-slug}: Full video title in kebab-case
The --list mode outputs to stdout only (no file saved).
Caching
On first fetch, the script saves:
meta.json— video metadata, chapters, cover image path, language infotranscript-raw.json— raw transcript snippets from YouTube API ({ text, start, duration }[])transcript-sentences.json— sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]), split by sentence-ending punctuation (.?!…。?!etc.), timestamps proportionally allocated by character length, CJK-aware text mergingimgs/cover.jpg— video thumbnail
Subsequent runs for the same video use cached data (no network calls). Use --refresh to force re-fetch. If a different language is requested, the cache is automatically refreshed.
When YouTube returns anti-bot / blocked responses on the direct InnerTube path, the script retries with alternate client identities and then falls back to yt-dlp if available. If fallback is needed but yt-dlp is unavailable, the agent should decide how to make yt-dlp available and continue rather than pushing the installation decision to the user.
SRT output (--format srt) is generated from transcript-raw.json. Text/markdown output uses transcript-sentences.json for natural sentence boundaries.
Workflow
When user provides a YouTube URL and wants the transcript:
- Run with
--listfirst if the user hasn't specified a language, to show available options - Always single-quote the URL when running the script — zsh treats
?as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use'https://www.youtube.com/watch?v=ID' - Default: run with
--chapters --speakersfor the richest output (chapters + speaker identification) - The script auto-saves cached data + output file and prints the file path
- For
--speakersmode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels
When user only wants a cover image or metadata, running the script with any option will also cache meta.json and imgs/cover.jpg.
When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.
Chapter & Speaker Workflow
Chapters (--chapters)
The script parses chapter timestamps from the video description (e.g., 0:00 Introduction), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as .md with a Table of Contents. No further processing needed.
If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.
Speaker Identification (--speakers)
Speaker identification requires AI processing. The script outputs a raw .md file containing:
- YAML frontmatter with video metadata (title, channel, date, cover, description, language)
- Video description (for speaker name extraction)
- Chapter list from description (if available)
- Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)
After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:
- Read the saved
.mdfile - Read the prompt template at
{baseDir}/prompts/speaker-transcript.md - Process the raw transcript following the prompt:
- Identify speakers using video metadata (title → guest, channel → host, description → names)
- Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
- Segment into chapters (use description chapters if available, else create from topic shifts)
- Format with
**Speaker Name:**labels, paragraph grouping (2-4 sentences), and[HH:MM:SS → HH:MM:SS]timestamps
- Overwrite the
.mdfile with the processed transcript (keep the YAML frontmatter)
When --speakers is used, --chapters is implied — the processed output always includes chapter segmentation.
Error Cases
| Error | Meaning |
|---|---|
| Transcripts disabled | Video has no captions at all |
| No transcript found | Requested language not available |
| Video unavailable | Video deleted, private, or region-locked |
| IP blocked | Too many requests, try again later |
| Age restricted | Video requires login for age verification |
| bot detected | The script retries alternate clients and then yt-dlp; if fallback tooling is missing, the agent should resolve that itself, otherwise if it still fails try YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER=safari (or your browser) |
Related skills
More from jimliu/baoyu-skills and the wider catalog.
baoyu-post-to-wechat
Post articles and image-text content to WeChat Official Accounts via API or browser automation.
baoyu-image-gen
Multi-provider AI image generation with text-to-image, reference images, batch processing, and aspect ratio control.
baoyu-markdown-to-html
Convert Markdown to styled HTML with WeChat-compatible themes, code highlighting, math, and Mermaid diagrams.
baoyu-infographic
Generate professional infographics with 21 layouts and 22 styles—analyze content and produce publication-ready visuals.
baoyu-cover-image
Generate customizable article cover images with 5 dimensions, 11 color palettes, and 7 rendering styles.
baoyu-article-illustrator
Analyze articles and generate illustrated images with consistent Type × Style × Palette approach.