AI Skill

Review

Audit score 70

image-to-video

Q: Which model should I use if I'm not sure?

Use HappyHorse 1.0 I2V by default—it's #1 on the Artificial Analysis Arena (Elo 1392), handles general animations well, and includes native audio synthesis in a single pass.

Q: Can I lip-sync a custom voiceover to a video?

Yes, use Wan 2.7 with the `audio_url` parameter. It accepts WAV/MP3 files (3–30s, ≤15MB) and drives lip-sync to your custom audio track.

Q: How do I create multi-language dub variants?

Use Wan 2.7 with the same prompt and seed but swap the `audio_url` per language call. This keeps visuals consistent while changing the voiceover.

Q: What image formats and sizes are supported?

JPEG, JPG, PNG, or WebP; minimum 300px; maximum 10MB; aspect ratio 1:2.5 to 2.5:1 (HappyHorse). Other models have similar specs.

Q: Can I combine a character image, scene reference, and voice reference in one video?

Yes, use Seedance 2.0 Pro. It accepts up to 9 images (first is primary), 3 reference videos (2–15s each), and 3 reference audio files to compose a multi-modal animation.

agentspace-so/runcomfy-agent-skills

Animate still images with the right model for your intent—HappyHorse, Wan, or Seedance on RunComfy.

Source View on skills.sh

What is image-to-video?

Image-to-video skill that intelligently routes your request to the best model in the RunComfy catalog. Use it to animate portraits with identity preservation, create custom-voiceover lip-sync videos, or compose multi-modal animations from image + reference video + audio.

Routes user intent to HappyHorse 1.0 I2V (best all-round, native audio), Wan 2.7 (custom voiceover lip-sync), or Seedance 2.0 Pro (multi-modal with refs)
Handles portrait animation with facial fidelity and identity preservation
Supports custom audio track lip-sync for voiceover and multi-language dubs
Enables multi-modal composition combining subject image, reference video, and reference audio
Includes model-specific prompting patterns to reduce iteration and improve output quality
Calls RunComfy CLI endpoints locally via `runcomfy run` commands

How to install image-to-video

npx skills add https://github.com/agentspace-so/runcomfy-agent-skills --skill image-to-video

Prerequisites

RunComfy CLI installed: `npm i -g @runcomfy/cli`
RunComfy account with `runcomfy login` authentication (or `RUNCOMFY_TOKEN` env var for CI/containers)
Source image URL (JPEG/PNG/WebP, min 300px, ≤10MB, aspect ratio 1:2.5 to 2.5:1)
For Wan 2.7 with audio: custom audio file (WAV/MP3, 3–30s, ≤15MB)
For Seedance 2.0 Pro: optional reference video (MP4/MOV, 2–15s) and/or reference audio (WAV/MP3, 2–15s, <15MB)

Claude Code

Cursor

Windsurf

Cline

How to use image-to-video

1.Install the skill: `npx skills add https://github.com/agentspace-so/runcomfy-agent-skills --skill image-to-video`
2.Prepare your source image URL and determine your intent: general animation, custom voiceover lip-sync, or multi-modal composition
3.For general portrait/product animation: call `runcomfy run happyhorse/happyhorse-1-0/image-to-video` with image_url and motion-focused prompt
4.For custom voiceover lip-sync: call `runcomfy run wan-ai/wan-2-7/text-to-video` with prompt, audio_url, and matching duration
5.For multi-modal animation: call `runcomfy run bytedance/seedance-v2/pro` with image_url, optional video_url and audio_url arrays, and prompt
6.Specify output directory with `--output-dir` and retrieve the generated video file

Use cases

Good for

Animate a product reveal or 360-degree view with smooth camera motion and geometry preservation
Create a talking-head video by lip-syncing a custom voiceover track to a generated character
Generate multi-language dub variants of the same scene by swapping audio tracks while keeping visuals consistent
Build a brand narrative combining a character image, scene reference video, and voice reference audio
Produce portrait animations with subtle breathing and camera drift while maintaining facial identity

Who it's for

Content creators and video producers
Marketing and advertising teams building multi-language campaigns
Product teams creating animated demos and reveals
Developers building video generation workflows into applications
Anyone needing to turn still images into short video clips

image-to-video FAQ

Which model should I use if I'm not sure?

Use HappyHorse 1.0 I2V by default—it's #1 on the Artificial Analysis Arena (Elo 1392), handles general animations well, and includes native audio synthesis in a single pass.

Can I lip-sync a custom voiceover to a video?

Yes, use Wan 2.7 with the `audio_url` parameter. It accepts WAV/MP3 files (3–30s, ≤15MB) and drives lip-sync to your custom audio track.

How do I create multi-language dub variants?

Use Wan 2.7 with the same prompt and seed but swap the `audio_url` per language call. This keeps visuals consistent while changing the voiceover.

What image formats and sizes are supported?

JPEG, JPG, PNG, or WebP; minimum 300px; maximum 10MB; aspect ratio 1:2.5 to 2.5:1 (HappyHorse). Other models have similar specs.

Can I combine a character image, scene reference, and voice reference in one video?

Yes, use Seedance 2.0 Pro. It accepts up to 9 images (first is primary), 3 reference videos (2–15s each), and 3 reference audio files to compose a multi-modal animation.

Full instructions (SKILL.md)

Source of truth, from agentspace-so/runcomfy-agent-skills.

name: image-to-video displayName: "Image-to-Video — Pro Pack on RunComfy" description: > Animate any still image on RunComfy — this skill is a smart router that matches the user's intent to the right i2v model in the RunComfy catalog. Picks HappyHorse 1.0 I2V (Arena #1, native audio, identity preservation) for general animations, Wan 2.7 with `audio_url` for custom-voiceover lip-sync, or Seedance 2.0 Pro for multi-modal animation from image + reference video + reference audio. Bundles each model's documented prompting patterns so the caller gets sharper output without burning iterations on the wrong model. Calls `runcomfy run <vendor>/<model>/image-to-video` (or endpoint variant) through the local RunComfy CLI. Triggers on "image to video", "image-to-video", "i2v", "animate image", "make this move", or any explicit ask to turn a still into video. homepage: https://www.runcomfy.com license: MIT

Image-to-Video — Pro Pack on RunComfy

runcomfy.com · HappyHorse I2V · Wan 2.7 · Seedance 2.0 Pro · GitHub

Image-to-video, intent-routed. This skill doesn't lock you to one model — it picks the right i2v model in the RunComfy catalog based on what the user actually wants: portrait animation, custom-voiceover lip-sync, or multi-modal composition.

npx skills add agentspace-so/runcomfy-skills --skill image-to-video -g

Pick the right model for the user's intent

User intent	Model	Why
Animate a portrait — keep identity stable	HappyHorse 1.0 I2V	#1 on Artificial Analysis Arena (Elo 1392); strong facial fidelity
Product reveal / 360 / macro motion	HappyHorse 1.0 I2V	Geometry preservation + smooth camera moves
Native synchronized ambient audio in one pass	HappyHorse 1.0 I2V	In-pass audio synthesis
Animate and lip-sync to a custom voiceover track	Wan 2.7 + `audio_url`	Accepts your own MP3/WAV (3–30s, ≤15MB) and drives lip-sync to it
Multi-language dub variants (same image, different audio per call)	Wan 2.7 + `audio_url`	Same shot, swap `audio_url` per language
Multi-modal — image + reference video + reference audio together	Seedance 2.0 Pro	Up to 9 image refs, 3 video refs (2–15s each), 3 audio refs
Brand-consistent narrative with character ref + scene ref + voice ref	Seedance 2.0 Pro	Image holds identity, video holds scene, audio holds voice
Default if unspecified	HappyHorse 1.0 I2V	Best all-round quality + native audio

The agent reads this table, classifies the user's intent, and picks the matching subsection below.

Prerequisites

RunComfy CLI — npm i -g @runcomfy/cli
RunComfy account — runcomfy login opens a browser device-code flow.
CI / containers — set RUNCOMFY_TOKEN=<token>.
A source image URL — JPEG/PNG/WebP, min 300px, ≤10MB; aspect 1:2.5 to 2.5:1 (HappyHorse) — other models have similar specs.

Route 1: HappyHorse 1.0 I2V — default for portrait / product / general animation

Model: happyhorse/happyhorse-1-0/image-to-video · Arena rank: #1 (Elo 1392)

Schema

Field	Type	Required	Default	Notes
`image_url`	string	yes	—	JPEG/JPG/PNG/WEBP. Min 300px. Aspect 1:2.5–2.5:1. ≤10MB.
`prompt`	string	yes	—	≤5000 non-CJK or 2500 CJK chars. Motion / camera / lighting description.
`resolution`	enum	no	`1080P`	`720P` or `1080P`.
`duration`	int	no	5	3–15 seconds.
`seed`	int	no	0	Reuse for variant comparisons.
`watermark`	bool	no	true	Provider watermark toggle.

Output aspect = input aspect. No independent reframing.

Invoke

runcomfy run happyhorse/happyhorse-1-0/image-to-video \
  --input '{
    "image_url": "https://.../portrait.jpg",
    "prompt": "Gentle camera drift around the subject'\''s face, subtle breathing motion, identity-stable features, soft natural light."
  }' \
  --output-dir <absolute/path>

Prompting tips

Lead with motion verbs: "drift", "dolly in", "orbit", "tilt up", "reveal", "blink", "breathe". Front-load what's MOVING.
Don't restate the image — the model sees it. Focus tokens on what changes.
Preservation goals explicit: "identity-stable features", "packaging unchanged", "background geometry stable".
Lighting evolution: "rim light intensifying", "shadows shortening as camera rises".
One beat per clip — single primary motion (orbit OR dolly OR tilt OR character action).

Route 2: Wan 2.7 + `audio_url` — when the user has a custom voiceover

Model: wan-ai/wan-2-7/text-to-video (NOT /image-to-video — Wan 2.7's t2v endpoint accepts an audio_url that drives lip-sync)

Note on i2v with Wan 2.7: Wan 2.7's primary i2v animation isn't on a dedicated endpoint here. For pure i2v (image animated by motion prompt only), prefer HappyHorse i2v. Use Wan 2.7 specifically when the user has a custom audio track they want lip-synced to a generated talking-head clip.

Schema (Wan 2.7 t2v with audio)

Field	Type	Required	Default	Notes
`prompt`	string	yes	—	Up to ~5000 chars. Describe the talking-head shot: framing, lighting, motion.
`audio_url`	string	yes (for lip-sync)	—	WAV/MP3, 3–30s, ≤15MB. Drives lip-sync.
`aspect_ratio`	enum	no	`16:9`	`16:9`, `9:16`, `1:1`, `4:3`, `3:4`.
`resolution`	enum	no	`1080p`	`720p` or `1080p`.
`duration`	enum	no	`5`	2–15 (whole seconds). Match your audio length.
`negative_prompt`	string	no	—	Concrete issues to avoid (e.g. "no subtitles, no flicker").
`seed`	int	no	—	Reproducibility.

Invoke

runcomfy run wan-ai/wan-2-7/text-to-video \
  --input '{
    "prompt": "Medium close-up of a confident spokesperson in a softly-lit recording booth, leaning slightly toward the camera, locked tripod, shallow DOF, warm key light from camera-left.",
    "audio_url": "https://.../voiceover-en.mp3",
    "duration": 12,
    "aspect_ratio": "9:16"
  }' \
  --output-dir <absolute/path>

Prompting tips

Describe the talking-head shot — framing, lighting, lens feel. The audio drives the lip-sync; the prompt builds the visual frame around it.
Match duration to audio length — clip will be silent past the audio if too long.
Use negative_prompt for issues: "no subtitles, no flicker, no distorted hands".
For multi-language dubs — same prompt, swap audio_url per call. Lock seed for visual consistency across languages.

Route 3: Seedance 2.0 Pro — multi-modal animation (image + ref video + ref audio)

Model: bytedance/seedance-v2/pro

Use when the user wants a single clip that combines: a subject image + scene from a reference video + voice tone from a reference audio.

Schema (Seedance 2.0 Pro, i2v-relevant fields)

Field	Type	Required	Default	Notes
`prompt`	string	yes	—	CN ≤500 chars OR EN ≤1000 words.
`image_url`	array	yes (for i2v)	`[]`	0–9 images. First is the primary subject.
`video_url`	array	no	`[]`	0–3 reference clips (MP4/MOV), 2–15s each.
`audio_url`	array	no	`[]`	0–3 reference audio (WAV/MP3), 2–15s, < 15MB each.
`aspect_ratio`	enum	no	`adaptive`	`adaptive`, `16:9`, `9:16`, `4:3`, `3:4`, `1:1`, `21:9`.
`duration`	int	no	5	4–15 (whole seconds).
`resolution`	enum	no	`720p`	`480p` or `720p`.
`generate_audio`	bool	no	true	In-pass synchronized speech / SFX / music.
`seed`	int	no	—	Reproducibility.

Invoke

runcomfy run bytedance/seedance-v2/pro \
  --input '{
    "prompt": "Subject from image 1 walks through the café in video 1, voice tone matches audio 1. Medium close-up, slow push-in, warm light, gentle ambience.",
    "image_url": ["https://.../subject.jpg"],
    "video_url": ["https://.../cafe-locked-shot.mp4"],
    "audio_url": ["https://.../voice-tone.mp3"],
    "duration": 8
  }' \
  --output-dir <absolute/path>

Prompting tips

Image vs text division — use image_url for what must stay stable (face, costume, brand); use prompt for what should evolve (action, mood, lighting).
Number the refs in the prompt: "subject from image 1, lighting from video 1, voice from audio 1". Seedance routes cues correctly.
Reference media specs — videos / audio must be 2–15s; audio < 15MB.
Don't mix radically different aesthetics — if image 1 is a watercolor and video 1 is photoreal, output drifts.

Limitations

Each route inherits its model's limits. HappyHorse: 15s cap, output aspect = input aspect. Wan 2.7: 15s cap, audio 3–30s/15MB. Seedance: 720p ceiling on this template, 15s cap.
No multi-route blending. This skill picks one model per call. If the user wants HappyHorse animation + Wan-style lip-sync in the same clip, that's two calls + a stitch (out of scope here).
Brand-specific overrides — if the user named a specific model variant not listed (e.g. Wan 2.6, Seedance 1.5), route to the corresponding brand skill (wan-2-7, seedance-v2) instead of forcing it through here.

Exit codes

code	meaning
0	success
64	bad CLI args
65	bad input JSON / schema mismatch
69	upstream 5xx
75	retryable: timeout / 429
77	not signed in or token rejected

Full reference: docs.runcomfy.com/cli/troubleshooting.

How it works

The skill picks one of HappyHorse 1.0 I2V / Wan 2.7 t2v+audio / Seedance 2.0 Pro based on user intent and invokes runcomfy run <model_id> with the matching JSON body. The CLI POSTs to the Model API, polls the request, fetches the result, and downloads any .runcomfy.net/.runcomfy.com URL into --output-dir. Ctrl-C cancels the remote request before exit.

Security & Privacy

Token storage: runcomfy login writes the API token to ~/.config/runcomfy/token.json with mode 0600 (owner-only read/write). Set RUNCOMFY_TOKEN env var to bypass the file entirely in CI / containers.
Input boundary: the user prompt is passed as a JSON string to the CLI via --input. The CLI does NOT shell-expand the prompt; it transmits the JSON body directly to the Model API over HTTPS. No shell injection surface from prompt content.
Third-party content: image / mask / video URLs you pass are fetched by the RunComfy model server, not by the CLI on your machine. Treat external URLs as untrusted; image-based prompt injection is a known risk for any image-edit / video-edit model.
Outbound endpoints: only model-api.runcomfy.net (request submission) and *.runcomfy.net / *.runcomfy.com (download whitelist for generated outputs). No telemetry, no callbacks.
Generated-file size cap: the CLI aborts any single download > 2 GiB to prevent disk-fill from a malicious or runaway model output.

Related skills

More from agentspace-so/runcomfy-agent-skills and the wider catalog.

image-to-video

What is image-to-video?

How to install image-to-video

How to use image-to-video

Use cases

image-to-video FAQ

Image-to-Video — Pro Pack on RunComfy

Pick the right model for the user's intent

Prerequisites

Route 1: HappyHorse 1.0 I2V — default for portrait / product / general animation

Schema

Invoke

Prompting tips

Route 2: Wan 2.7 + audio_url — when the user has a custom voiceover

Schema (Wan 2.7 t2v with audio)

Invoke

Prompting tips

Route 3: Seedance 2.0 Pro — multi-modal animation (image + ref video + ref audio)

Schema (Seedance 2.0 Pro, i2v-relevant fields)

Invoke

Prompting tips

Limitations

Exit codes

How it works

Security & Privacy

Related skills

video-edit

nano-banana-2

image-edit

nano-banana-edit

flux-kontext

wan-2-7

Route 2: Wan 2.7 + `audio_url` — when the user has a custom voiceover