AI Skill
Review
Audit score 70

image-to-video

agentspace-so/runcomfy-agent-skills

Animate still images with the right model for your intent—HappyHorse, Wan, or Seedance on RunComfy.

What is image-to-video?

Image-to-video skill that intelligently routes your request to the best model in the RunComfy catalog. Use it to animate portraits with identity preservation, create custom-voiceover lip-sync videos, or compose multi-modal animations from image + reference video + audio.

  • Routes user intent to HappyHorse 1.0 I2V (best all-round, native audio), Wan 2.7 (custom voiceover lip-sync), or Seedance 2.0 Pro (multi-modal with refs)
  • Handles portrait animation with facial fidelity and identity preservation
  • Supports custom audio track lip-sync for voiceover and multi-language dubs
  • Enables multi-modal composition combining subject image, reference video, and reference audio
  • Includes model-specific prompting patterns to reduce iteration and improve output quality
  • Calls RunComfy CLI endpoints locally via `runcomfy run` commands

How to install image-to-video

npx skills add https://github.com/agentspace-so/runcomfy-agent-skills --skill image-to-video
Prerequisites
  • RunComfy CLI installed: `npm i -g @runcomfy/cli`
  • RunComfy account with `runcomfy login` authentication (or `RUNCOMFY_TOKEN` env var for CI/containers)
  • Source image URL (JPEG/PNG/WebP, min 300px, ≤10MB, aspect ratio 1:2.5 to 2.5:1)
  • For Wan 2.7 with audio: custom audio file (WAV/MP3, 3–30s, ≤15MB)
  • For Seedance 2.0 Pro: optional reference video (MP4/MOV, 2–15s) and/or reference audio (WAV/MP3, 2–15s, <15MB)
Claude Code
Cursor
Windsurf
Cline

How to use image-to-video

  1. 1.Install the skill: `npx skills add https://github.com/agentspace-so/runcomfy-agent-skills --skill image-to-video`
  2. 2.Prepare your source image URL and determine your intent: general animation, custom voiceover lip-sync, or multi-modal composition
  3. 3.For general portrait/product animation: call `runcomfy run happyhorse/happyhorse-1-0/image-to-video` with image_url and motion-focused prompt
  4. 4.For custom voiceover lip-sync: call `runcomfy run wan-ai/wan-2-7/text-to-video` with prompt, audio_url, and matching duration
  5. 5.For multi-modal animation: call `runcomfy run bytedance/seedance-v2/pro` with image_url, optional video_url and audio_url arrays, and prompt
  6. 6.Specify output directory with `--output-dir` and retrieve the generated video file

Use cases

Good for
  • Animate a product reveal or 360-degree view with smooth camera motion and geometry preservation
  • Create a talking-head video by lip-syncing a custom voiceover track to a generated character
  • Generate multi-language dub variants of the same scene by swapping audio tracks while keeping visuals consistent
  • Build a brand narrative combining a character image, scene reference video, and voice reference audio
  • Produce portrait animations with subtle breathing and camera drift while maintaining facial identity
Who it's for
  • Content creators and video producers
  • Marketing and advertising teams building multi-language campaigns
  • Product teams creating animated demos and reveals
  • Developers building video generation workflows into applications
  • Anyone needing to turn still images into short video clips

image-to-video FAQ

Which model should I use if I'm not sure?

Use HappyHorse 1.0 I2V by default—it's #1 on the Artificial Analysis Arena (Elo 1392), handles general animations well, and includes native audio synthesis in a single pass.

Can I lip-sync a custom voiceover to a video?

Yes, use Wan 2.7 with the `audio_url` parameter. It accepts WAV/MP3 files (3–30s, ≤15MB) and drives lip-sync to your custom audio track.

How do I create multi-language dub variants?

Use Wan 2.7 with the same prompt and seed but swap the `audio_url` per language call. This keeps visuals consistent while changing the voiceover.

What image formats and sizes are supported?

JPEG, JPG, PNG, or WebP; minimum 300px; maximum 10MB; aspect ratio 1:2.5 to 2.5:1 (HappyHorse). Other models have similar specs.

Can I combine a character image, scene reference, and voice reference in one video?

Yes, use Seedance 2.0 Pro. It accepts up to 9 images (first is primary), 3 reference videos (2–15s each), and 3 reference audio files to compose a multi-modal animation.

Full instructions (SKILL.md)

Source of truth, from agentspace-so/runcomfy-agent-skills.


name: image-to-video displayName: "Image-to-Video — Pro Pack on RunComfy" description: > Animate any still image on RunComfy — this skill is a smart router that matches the user's intent to the right i2v model in the RunComfy catalog. Picks HappyHorse 1.0 I2V (Arena #1, native audio, identity preservation) for general animations, Wan 2.7 with audio_url for custom-voiceover lip-sync, or Seedance 2.0 Pro for multi-modal animation from image + reference video + reference audio. Bundles each model's documented prompting patterns so the caller gets sharper output without burning iterations on the wrong model. Calls runcomfy run <vendor>/<model>/image-to-video (or endpoint variant) through the local RunComfy CLI. Triggers on "image to video", "image-to-video", "i2v", "animate image", "make this move", or any explicit ask to turn a still into video. homepage: https://www.runcomfy.com license: MIT

Image-to-Video — Pro Pack on RunComfy

runcomfy.com · HappyHorse I2V · Wan 2.7 · Seedance 2.0 Pro · GitHub

Image-to-video, intent-routed. This skill doesn't lock you to one model — it picks the right i2v model in the RunComfy catalog based on what the user actually wants: portrait animation, custom-voiceover lip-sync, or multi-modal composition.

npx skills add agentspace-so/runcomfy-skills --skill image-to-video -g

Pick the right model for the user's intent

User intentModelWhy
Animate a portrait — keep identity stableHappyHorse 1.0 I2V#1 on Artificial Analysis Arena (Elo 1392); strong facial fidelity
Product reveal / 360 / macro motionHappyHorse 1.0 I2VGeometry preservation + smooth camera moves
Native synchronized ambient audio in one passHappyHorse 1.0 I2VIn-pass audio synthesis
Animate and lip-sync to a custom voiceover trackWan 2.7 + audio_urlAccepts your own MP3/WAV (3–30s, ≤15MB) and drives lip-sync to it
Multi-language dub variants (same image, different audio per call)Wan 2.7 + audio_urlSame shot, swap audio_url per language
Multi-modal — image + reference video + reference audio togetherSeedance 2.0 ProUp to 9 image refs, 3 video refs (2–15s each), 3 audio refs
Brand-consistent narrative with character ref + scene ref + voice refSeedance 2.0 ProImage holds identity, video holds scene, audio holds voice
Default if unspecifiedHappyHorse 1.0 I2VBest all-round quality + native audio

The agent reads this table, classifies the user's intent, and picks the matching subsection below.

Prerequisites

  1. RunComfy CLInpm i -g @runcomfy/cli
  2. RunComfy accountruncomfy login opens a browser device-code flow.
  3. CI / containers — set RUNCOMFY_TOKEN=<token>.
  4. A source image URL — JPEG/PNG/WebP, min 300px, ≤10MB; aspect 1:2.5 to 2.5:1 (HappyHorse) — other models have similar specs.

Route 1: HappyHorse 1.0 I2V — default for portrait / product / general animation

Model: happyhorse/happyhorse-1-0/image-to-video · Arena rank: #1 (Elo 1392)

Schema

FieldTypeRequiredDefaultNotes
image_urlstringyesJPEG/JPG/PNG/WEBP. Min 300px. Aspect 1:2.5–2.5:1. ≤10MB.
promptstringyes≤5000 non-CJK or 2500 CJK chars. Motion / camera / lighting description.
resolutionenumno1080P720P or 1080P.
durationintno53–15 seconds.
seedintno0Reuse for variant comparisons.
watermarkboolnotrueProvider watermark toggle.

Output aspect = input aspect. No independent reframing.

Invoke

runcomfy run happyhorse/happyhorse-1-0/image-to-video \
  --input '{
    "image_url": "https://.../portrait.jpg",
    "prompt": "Gentle camera drift around the subject'\''s face, subtle breathing motion, identity-stable features, soft natural light."
  }' \
  --output-dir <absolute/path>

Prompting tips

  • Lead with motion verbs: "drift", "dolly in", "orbit", "tilt up", "reveal", "blink", "breathe". Front-load what's MOVING.
  • Don't restate the image — the model sees it. Focus tokens on what changes.
  • Preservation goals explicit: "identity-stable features", "packaging unchanged", "background geometry stable".
  • Lighting evolution: "rim light intensifying", "shadows shortening as camera rises".
  • One beat per clip — single primary motion (orbit OR dolly OR tilt OR character action).

Route 2: Wan 2.7 + audio_url — when the user has a custom voiceover

Model: wan-ai/wan-2-7/text-to-video (NOT /image-to-video — Wan 2.7's t2v endpoint accepts an audio_url that drives lip-sync)

Note on i2v with Wan 2.7: Wan 2.7's primary i2v animation isn't on a dedicated endpoint here. For pure i2v (image animated by motion prompt only), prefer HappyHorse i2v. Use Wan 2.7 specifically when the user has a custom audio track they want lip-synced to a generated talking-head clip.

Schema (Wan 2.7 t2v with audio)

FieldTypeRequiredDefaultNotes
promptstringyesUp to ~5000 chars. Describe the talking-head shot: framing, lighting, motion.
audio_urlstringyes (for lip-sync)WAV/MP3, 3–30s, ≤15MB. Drives lip-sync.
aspect_ratioenumno16:916:9, 9:16, 1:1, 4:3, 3:4.
resolutionenumno1080p720p or 1080p.
durationenumno52–15 (whole seconds). Match your audio length.
negative_promptstringnoConcrete issues to avoid (e.g. "no subtitles, no flicker").
seedintnoReproducibility.

Invoke

runcomfy run wan-ai/wan-2-7/text-to-video \
  --input '{
    "prompt": "Medium close-up of a confident spokesperson in a softly-lit recording booth, leaning slightly toward the camera, locked tripod, shallow DOF, warm key light from camera-left.",
    "audio_url": "https://.../voiceover-en.mp3",
    "duration": 12,
    "aspect_ratio": "9:16"
  }' \
  --output-dir <absolute/path>

Prompting tips

  • Describe the talking-head shot — framing, lighting, lens feel. The audio drives the lip-sync; the prompt builds the visual frame around it.
  • Match duration to audio length — clip will be silent past the audio if too long.
  • Use negative_prompt for issues: "no subtitles, no flicker, no distorted hands".
  • For multi-language dubs — same prompt, swap audio_url per call. Lock seed for visual consistency across languages.

Route 3: Seedance 2.0 Pro — multi-modal animation (image + ref video + ref audio)

Model: bytedance/seedance-v2/pro

Use when the user wants a single clip that combines: a subject image + scene from a reference video + voice tone from a reference audio.

Schema (Seedance 2.0 Pro, i2v-relevant fields)

FieldTypeRequiredDefaultNotes
promptstringyesCN ≤500 chars OR EN ≤1000 words.
image_urlarrayyes (for i2v)[]0–9 images. First is the primary subject.
video_urlarrayno[]0–3 reference clips (MP4/MOV), 2–15s each.
audio_urlarrayno[]0–3 reference audio (WAV/MP3), 2–15s, < 15MB each.
aspect_ratioenumnoadaptiveadaptive, 16:9, 9:16, 4:3, 3:4, 1:1, 21:9.
durationintno54–15 (whole seconds).
resolutionenumno720p480p or 720p.
generate_audioboolnotrueIn-pass synchronized speech / SFX / music.
seedintnoReproducibility.

Invoke

runcomfy run bytedance/seedance-v2/pro \
  --input '{
    "prompt": "Subject from image 1 walks through the café in video 1, voice tone matches audio 1. Medium close-up, slow push-in, warm light, gentle ambience.",
    "image_url": ["https://.../subject.jpg"],
    "video_url": ["https://.../cafe-locked-shot.mp4"],
    "audio_url": ["https://.../voice-tone.mp3"],
    "duration": 8
  }' \
  --output-dir <absolute/path>

Prompting tips

  • Image vs text division — use image_url for what must stay stable (face, costume, brand); use prompt for what should evolve (action, mood, lighting).
  • Number the refs in the prompt: "subject from image 1, lighting from video 1, voice from audio 1". Seedance routes cues correctly.
  • Reference media specs — videos / audio must be 2–15s; audio < 15MB.
  • Don't mix radically different aesthetics — if image 1 is a watercolor and video 1 is photoreal, output drifts.

Limitations

  • Each route inherits its model's limits. HappyHorse: 15s cap, output aspect = input aspect. Wan 2.7: 15s cap, audio 3–30s/15MB. Seedance: 720p ceiling on this template, 15s cap.
  • No multi-route blending. This skill picks one model per call. If the user wants HappyHorse animation + Wan-style lip-sync in the same clip, that's two calls + a stitch (out of scope here).
  • Brand-specific overrides — if the user named a specific model variant not listed (e.g. Wan 2.6, Seedance 1.5), route to the corresponding brand skill (wan-2-7, seedance-v2) instead of forcing it through here.

Exit codes

codemeaning
0success
64bad CLI args
65bad input JSON / schema mismatch
69upstream 5xx
75retryable: timeout / 429
77not signed in or token rejected

Full reference: docs.runcomfy.com/cli/troubleshooting.

How it works

The skill picks one of HappyHorse 1.0 I2V / Wan 2.7 t2v+audio / Seedance 2.0 Pro based on user intent and invokes runcomfy run <model_id> with the matching JSON body. The CLI POSTs to the Model API, polls the request, fetches the result, and downloads any .runcomfy.net/.runcomfy.com URL into --output-dir. Ctrl-C cancels the remote request before exit.

Security & Privacy

  • Token storage: runcomfy login writes the API token to ~/.config/runcomfy/token.json with mode 0600 (owner-only read/write). Set RUNCOMFY_TOKEN env var to bypass the file entirely in CI / containers.
  • Input boundary: the user prompt is passed as a JSON string to the CLI via --input. The CLI does NOT shell-expand the prompt; it transmits the JSON body directly to the Model API over HTTPS. No shell injection surface from prompt content.
  • Third-party content: image / mask / video URLs you pass are fetched by the RunComfy model server, not by the CLI on your machine. Treat external URLs as untrusted; image-based prompt injection is a known risk for any image-edit / video-edit model.
  • Outbound endpoints: only model-api.runcomfy.net (request submission) and *.runcomfy.net / *.runcomfy.com (download whitelist for generated outputs). No telemetry, no callbacks.
  • Generated-file size cap: the CLI aborts any single download > 2 GiB to prevent disk-fill from a malicious or runaway model output.