hyperframes-media
heygen-com/hyperframes
Audio and media engine for HyperFrames: TTS, background music, sound effects, transcription, captions, and background removal.
What is hyperframes-media?
Unified audio engine (`scripts/audio.mjs`) that generates voiceovers, background music, sound effects, and captions for HyperFrames compositions. Supports multi-provider TTS (HeyGen/ElevenLabs/Kokoro), music retrieval or local generation, sound effect library access, and caption authoring. Degrades gracefully from cloud services to local fallbacks based on credential availability.
- Generate voiceovers via HeyGen Starfish TTS, ElevenLabs, or local Kokoro (with native word timestamps from HeyGen)
- Retrieve background music from HeyGen audio library or generate locally via Lyria/MusicGen
- Retrieve sound effects from HeyGen library or use bundled 21-file fallback library
- Transcribe audio with Whisper and extract per-word timing data
- Author captions, subtitles, lyrics, and karaoke styling with per-word control
- Remove backgrounds from media assets
How to install hyperframes-media
npx skills add https://github.com/heygen-com/hyperframes --skill hyperframes-media- HeyGen account (optional but recommended; sign in via `npx hyperframes auth login`)
- For local TTS fallback: Kokoro-82M (no additional setup required)
- For local BGM generation: Lyria or MusicGen (Python dependencies; optional)
- Node.js and npm to run the audio engine script
How to use hyperframes-media
- 1.Run `npx hyperframes auth status` to check credential status and show sign-in guidance before generating audio
- 2.Create an `audio_request.json` file with lines (text + optional SFX cues) and BGM mode/query/prompt
- 3.Execute `node <MEDIA_DIR>/scripts/audio.mjs --request ./audio_request.json --hyperframes . --out ./audio_meta.json` to generate assets
- 4.If BGM is set to generate mode, run `scripts/wait-bgm.mjs` before assembly to wait for detached generation to complete
- 5.Use `--only tts,bgm,sfx` flag to run subsets and merge results into existing output (e.g., TTS+BGM first, then SFX)
- 6.Reference the output `audio_meta.json` (id-keyed metadata with paths, durations, word timings) in your composition assembly
Use cases
- Create narrated video compositions with precise word-level timing for animation sync
- Generate background music matching mood prompts or retrieve licensed tracks from HeyGen library
- Add sound effects to scenes using semantic search or fallback to local bundled library
- Build subtitle/karaoke tracks with per-word styling and timing information
- Produce multi-language voiceovers by switching TTS providers and languages
- Video composition developers building HyperFrames projects
- Content creators needing automated voiceover and music generation
- Developers building interactive or animated video experiences
- Teams working with multi-language or multi-voice narration
- Workflows requiring precise audio timing and caption synchronization
hyperframes-media FAQ
The engine degrades gracefully: TTS falls back to ElevenLabs (if `$ELEVENLABS_API_KEY` is set) then Kokoro local; BGM switches from retrieval to local Lyria/MusicGen generation; SFX uses the bundled 21-file library. Always run `npx hyperframes auth status` first to show the user their options and let them decide whether to sign in.
HeyGen TTS provides native word timestamps. For other providers (ElevenLabs, Kokoro), chain the `transcribe` step after TTS generation to extract per-word timing. Pass `--words narration.words.json` to `scripts/heygen-tts.mjs` to capture HeyGen word data.
Yes. Set `bgm.mode` to `generate` in your request. The engine will use local Lyria or MusicGen. BGM generation is spawned detached, so run `scripts/wait-bgm.mjs` before assembling to ensure it completes.
The engine retrieves SFX from the HeyGen audio library by default (with HeyGen credential) or falls back to the bundled 21-file library. Custom SFX integration is not covered in the core engine; reference `references/sfx.md` for details on the bundled manifest and retrieval behavior.
Use `--only tts,bgm,sfx` to run a subset of capabilities and merge results into an existing `--out` file. For example, generate TTS+BGM early, then add SFX once cues are finalized, without regenerating earlier assets.
Full instructions (SKILL.md)
Source of truth, from heygen-com/hyperframes.
name: hyperframes-media
description: Audio and media assets for HyperFrames compositions, produced by one shared audio engine (scripts/audio.mjs) — multi-provider TTS (HeyGen / ElevenLabs / Kokoro local), background music + sound effects (HeyGen audio-library retrieval by default, with local Lyria / MusicGen BGM generation and a bundled SFX library as the no-credential fallback), Whisper transcription, background removal, and caption authoring. Use for voiceover / TTS, BGM, SFX / sound effects, transcription, captions / subtitles / lyrics / karaoke / per-word styling, voice + provider selection, and music-mood prompting.
HyperFrames Media
Create the audio and media assets a composition needs — voiceover (TTS), background music + sound effects, transcription, captions, background removal — then consume and animate that data in HTML. For placing assets into compositions, see hyperframes-core.
The audio engine — one source for TTS · BGM · SFX
Workflows do NOT hand-roll audio or vendor a copy. There is one engine — scripts/audio.mjs — that takes a neutral audio_request.json and writes audio_meta.json (plus assets under assets/voice|bgm|sfx):
# <MEDIA_DIR> = this skill's directory
node <MEDIA_DIR>/scripts/audio.mjs --request ./audio_request.json --hyperframes . --out ./audio_meta.json
All three capabilities degrade on ONE switch — whether a HeyGen credential is present (resolved from $HEYGEN_API_KEY / $HYPERFRAMES_API_KEY / ~/.heygen, not the CLI):
| Capability | HeyGen credential present | absent |
|---|---|---|
| TTS | HeyGen Starfish REST (native word timestamps) | → ElevenLabs → Kokoro (chain transcribe for words) |
| BGM | HeyGen music retrieval | Lyria → MusicGen local generation (detached) |
| SFX | HeyGen sound-effects retrieval (min_score 0.4) | bundled 21-file library (assets/sfx/) |
- Request (
audio_request.json):{ provider?, lang?, speed?, lines: [{ id, text, sfx?: [names] }], bgm: { mode?, query?, prompt? } }.idjoins each line back to the caller's model (a frame number, a scene id, …).bgm.mode=retrieve | generate | none; omit for auto (retrieve when credentialed, else generate). An explicitretrieveis strict — it skips rather than starting a detached generate (for callers with nowait-bgmstep). - Output (
audio_meta.json, id-keyed):{ tts_provider, voice_id, bgm, bgm_pending, …, voices: [{ id, path, duration_s, words }], sfx: [{ id, name, file, source, offset_s, duration_s, volume }], total_duration_s }. --only tts,bgm,sfxruns a subset and merges into an existing--out(e.g. TTS+BGM early, SFX once cues exist).- BGM generate is spawned detached (
bgm_pending: true) — runscripts/wait-bgm.mjsbefore assembling. scripts/heygen-tts.mjsis a single-shot CLI over the same code (one text → wav + words) for when you just need HeyGen TTS without a request file.
Full flag list + the audio_meta.json schema live in the header of scripts/audio.mjs. The references below cover the provider details and edge cases behind each capability.
Preflight — show sign-in status before any audio
Always run this before generating voice or BGM — inside a full workflow or a one-off "generate me a BGM/voiceover" request. No HeyGen credential is not a reason to silently fall back to local engines: first recommend signing in and let the user decide. Run the shared preflight and relay its output verbatim — don't improvise your own "missing key" prompt, and don't offer to write keys into a per-repo .env:
npx hyperframes auth status
- Signed in → it prints the account; proceed.
- Not signed in (
exit 1is expected here — "not signed in" is a normal state, not a failure) → it prints registration-first guidance. Recommend signing in:npx hyperframes auth loginis browser OAuth — it signs in and creates an account (always available through this repo's CLI). To use an existing HeyGen API key (from app.heygen.com/settings/api), runnpx hyperframes auth login --api-key— it saves to the shared~/.heygen(no per-repo.env). The output also lists the local engines voice/BGM will fall back to and apiphint when deps are missing. Relay this output as-is — don't paraphrase it into your own wording. Then STOP and wait for the user to choose — sign in, or say "go" / "local" to continue offline — before generating anything. This is a real decision point, not a passing note: don't fold it into another question, and don't proceed past it on your own. (Exception: in autonomous / non-interactive mode, note the status and continue offline.) npx hyperframes auth status --jsonreturns{ configured, recommended_action, offline_engines }for deterministic branching.- If the CLI can't run (not on PATH and
npxcan't fetch it) → still recommend signing in (npx hyperframes auth login) and STOP for the user's choice — don't treat "no credential" as a silent green light for local generation.
Credential resolution, full key priority, and the local-dependency list are in references/requirements.md.
Provider chains (the detail behind the engine)
TTS — first available provider wins (the engine, or npx hyperframes tts "..."):
| Order | Provider | Detected when | Word timestamps |
|---|---|---|---|
| 1 | HeyGen (Starfish) | $HEYGEN_API_KEY / hyperframes auth login | Yes, native — pass --words narration.words.json to capture |
| 2 | ElevenLabs | $ELEVENLABS_API_KEY set | No — chain transcribe after |
| 3 | Kokoro-82M (local, 54 voices) | always (no key required) | No — chain transcribe after |
The published
hyperframes ttsCLI is often the local-only build (its--helpsays "Kokoro-82M", no--provider/--words) and silently falls back to Kokoro even with$HEYGEN_API_KEYset. That is why the engine's HeyGen path is the self-containedscripts/heygen-tts.mjs(REST), NOT the CLI; the CLI is used only for the Kokoro path. Seereferences/tts.md.
BGM & SFX — by default retrieved from the HeyGen audio library (/v3/audio/sounds), same credential as HeyGen TTS, with the no-credential fallback from the switch above:
| Asset | HeyGen type | Lands in | Fallback (no credential) |
|---|---|---|---|
| BGM | music | assets/bgm/track.mp3 (retrieve) · track.wav (generate) | Lyria / MusicGen generation |
| SFX | sound_effects (min_score 0.4) | assets/sfx/<slug>.mp3 | bundled 21-file library (assets/sfx/* + manifest.json) |
See references/bgm.md and references/sfx.md.
Routing
| Task | Read |
|---|---|
The audio engine — request/meta schema, --only, the switch | scripts/audio.mjs (header comment) |
npx hyperframes tts / heygen-tts.mjs — providers, voices, words | references/tts.md |
| BGM — HeyGen retrieval + local Lyria / MusicGen generation | references/bgm.md |
| SFX — HeyGen retrieval (min_score 0.4) + bundled local library | references/sfx.md |
npx hyperframes transcribe — Whisper, model rules, output shape | references/transcribe.md |
npx hyperframes remove-background — transparent cutouts | references/remove-background.md |
| TTS → transcription → captions (no recorded voiceover) | references/tts-to-captions.md |
| Caption authoring — style detection, layout, word grouping, exit | references/captions/authoring.md |
| Transcript handling — input formats, quality gates, cleanup, APIs | references/captions/transcript-handling.md |
| Caption motion — karaoke, marker effects, audio-reactive | references/captions/motion.md |
| Model caches, system dependencies, troubleshooting | references/requirements.md |
Non-negotiable rules
- One engine, no vendored copies. Produce audio via
scripts/audio.mjs(orheygen-tts.mjsfor one-shot HeyGen TTS). Don't re-implement TTS/BGM/SFX inside a workflow — write anaudio_request.jsonadapter and call the engine. - "HeyGen available" = a resolvable credential, not the CLI. The whole switch keys off
heygenCredential(); the publishedhyperframes ttsmay be Kokoro-only, and there is nohyperframes bgm/hyperframes sfxcommand at all. - Voice IDs are provider-specific.
am_michaelis Kokoro-only; HeyGen UUIDs don't work on Kokoro. If you pass--voice, also pin--providerto avoid silent provider drift when the user's env changes. - Always pass
--modeltotranscribe. The CLI defaultsmall.ensilently translates non-English audio. Seereferences/transcribe.md→ "Language Rule". - HeyGen returns word timestamps; ElevenLabs / Kokoro do not. The engine chains
transcribeautomatically for the latter two; standalone, pass--wordsto HeyGen or runtranscribeagainst the audio file. - Captions consume the flat word-array format with
{ id, text, start, end }. Seereferences/transcribe.md→ "Output Shape". remove-background --background-outputis hole-cut, not inpainted. For "scene without the person", a different tool is needed. Seereferences/remove-background.md→ "When NOT the right tool".- BGM/SFX default to HeyGen retrieval; the no-credential fallback is generation (BGM) or the bundled library (SFX).
/audio/soundsranks by a text query — name effects concretely (glass shatter, notdramatic sound); a no-match skips, never blocks the render. SFX sit at volume ~0.35 under voice + BGM. Seereferences/sfx.md/references/bgm.md. - Treat workflow caption HTML as generated output. For preset-backed videos, the reusable skin source lives at
.hyperframes/caption-skin.htmland the workflow script writescompositions/captions.html; do not edit generatedcompositions/captions.htmlto fix the skin. Rebuild via the workflow'scaptions.mjs, or use that workflow's explicit overrides mechanism when present.
Related skills
More from heygen-com/hyperframes and the wider catalog.
hyperframes
Router and entry skill for video authoring—renders video from HTML with intent-based workflow selection.
hyperframes-cli
CLI for HyperFrames video composition: scaffold, lint, validate, render locally or on AWS Lambda.
hyperframes-registry
Install and wire reusable blocks and components into HyperFrames compositions via registry.
remotion-to-hyperframes
Port existing Remotion (React) video compositions to HyperFrames HTML—one-way translation only.
gsap
GSAP animation reference for HyperFrames compositions with timelines, easing, and performance optimization.
website-to-hyperframes
Capture a website and create professional HyperFrames videos from it.