Skip to main content
ralphy generate is the single CLI gate for every model call. Image, video, voiceover, music, sfx, captions — all five sub-verbs land here, all log to the same JSONL, all update the same asset-manifest.json, all version automatically on regen. Most of the time the agent drives these on your behalf during the one-beat-at-a-time intake loop. You’ll call them by hand when you want to regenerate a single slot, sweep variants, or preview cost before firing. The verb’s whole job is to be the choke point. No raw curl, no bunx tsx against a media API, no ffmpeg shells — every recipe lives behind a ralphy verb so the gen-log, the manifest, and the cost rollup all stay honest. That’s AGENTS invariant #2.

The five sub-verbs at a glance

Sub-verbDefault modelOutputPer-call cost (typical)
generate imagegoogle/gemini-3-pro-image-preview (multi-ref / character consistency); openai/gpt-5.4-image-2 for premium typographyPNG, slot-named~$0.04
generate videokwaivgi/kling-v3.0-proMP4, slot-named0.300.30 – 2.40
generate voiceoverelevenlabs/eleven_multilingual_v2MP3, slot-named~$0.30 per 1k chars
generate musicElevenLabs Music (instrumental default)MP3, slot-named~$0.005/sec
generate sfxElevenLabs Sound Generation (≤22s)MP3, slot-namedflat per call
generate captionsElevenLabs Scribe v1 (word-level)JSON Caption[], slot-named~$0.005 per minute
Defaults are the live values from cli/commands/generate.ts as of v0.3.0. Always cross-check MODELS.md before assuming — Claude’s training is stale, and Ralphy reads MODELS.md before every call for a reason.

When the agent drives, when you drive

Most generation happens implicitly during the intake loop. Ralphy generates the location-master-plate, the persona masters, then scene anchors one at a time, surfacing each to you for approval. You don’t type the commands; you say “go” and Ralphy fires them. The CLI invocation is identical either way — what changes is who’s at the keyboard. You’ll run ralphy generate yourself in these cases:
  • Regenerate one slot after a miss — “scene-03 looks off, try seedance”.
  • Sweep variants--variants 4 on an image to compare A/B/C/D in parallel.
  • Preview cost before firing a long video — --dry-run returns the resolved request and the cost estimate without spending money.
  • Model swap — pass --model <id> to override the default for a single call.
  • Recovery mid-batch — one slot of a batch failed; rerun just that slot.

Slot IDs and the manifest

Every generated file lands in a slot, and the slot id is the file name’s prefix. The convention is {scene-id}-{type}-{descriptor}:
scene-01-bg-image      → workspace/projects/<id>/assets/scene-01-bg-image.png
scene-01-vid           → workspace/projects/<id>/assets/scene-01-vid.mp4
scene-01-vo-primary    → workspace/projects/<id>/assets/scene-01-vo-primary.mp3
bed-01                 → workspace/projects/<id>/assets/bed-01.mp3
The slot id is canonical kebab-case. Pass anything close (uppercase, underscores) and Ralphy auto-normalizes with a stderr warning so you learn the canonical form for next time. The asset-manifest.json at the project root tracks every slot:
{
  "slots": {
    "scene-01-bg-image": {
      "kind": "image",
      "path": "workspace/projects/syrup-001/assets/scene-01-bg-image.png",
      "model": "google/gemini-3-pro-image-preview",
      "costUsd": 0.04,
      "generatedAt": "2026-05-19T14:32:11.483Z"
    }
  }
}
Read it any time with ralphy project show <id> --assets. Detail in Asset manifest.

Versioning on regen

When you regenerate a slot that already exists, the new file lands at .v2.<ext>, then .v3, .v4, and so on. The existing file is preserved unchanged. The manifest tracks both; only “promoting” a chosen variant on your explicit say-so flips the manifest pointer to a new winner.
assets/scene-03-vid.mp4       ← original (v1, the manifest pointer)
assets/scene-03-vid.v2.mp4    ← first regen
assets/scene-03-vid.v3.mp4    ← second regen
This is AGENTS invariant #13. The --force-overwrite flag bypasses it and writes in place — you almost never want this. Detail in Reviewing and iterating.

Image — generate image

The bread-and-butter still gen.
ralphy generate image \
  --project syrup-001 \
  --slot scene-02-bg \
  --prompt "Anna pours syrup into iced coffee, kitchen counter, autumn light" \
  --ref workspace/projects/syrup-001/assets/product-master.png \
  --ref workspace/projects/syrup-001/assets/persona-master.png \
  --size 1080x1920
--model
string
OpenRouter model id. Default google/gemini-3-pro-image-preview (multi-ref / character consistency, nano-banana-pro lineage, ~0.15/image,tolerates4concurrentcalls).Switchtoopenai/gpt5.4image2forpremiumtypographyonlabelsandheroproductshotswherethewordmarkmustreadcrisp( 0.15/image, tolerates ≥4 concurrent calls). Switch to `openai/gpt-5.4-image-2` for premium typography on labels and hero product shots where the wordmark must read crisp (~0.20/image, caps at 1 concurrent).
--ref
string[]
Reference image(s). URL, local path, or data: URI. Local paths auto-convert to data: URI. Repeat the flag to pass multiple refs.
--size
string
Size hint, default 1080x1920. Passed as prompt-level guidance — gemini and gpt image models don’t accept exact pixel dimensions and round to their natural sizes.
--variants
number
Generate N parallel variants. Writes <slot>-v1.png through <slot>-vN.png. Capped at 8.
--negative
string
Negative prompt — what the image should not contain.
--dry-run
boolean
Print the resolved request and the cost estimate; do not submit. Always free.

Video — generate video

The expensive call. Always --dry-run first if you’re not sure.
ralphy generate video \
  --project syrup-001 \
  --slot scene-02-vid \
  --prompt "Anna pours syrup, slow tilt-up to her smile, 35mm naturalistic" \
  --duration 5 \
  --first-frame workspace/projects/syrup-001/assets/scene-02-bg.png \
  --aspect-ratio 9:16 \
  --resolution 720p
--model
string
Default kwaivgi/kling-v3.0-pro. Switch to bytedance/seedance-2.0 for horror, POV, walking, jump-scares, or any non-default physics motion (per MODELS.md and the venom-bodywash postmortem).
--duration
number
required
Seconds. Per-model supported_durations may be discrete — hailuo accepts only 6 and 10. Run ralphy models show <id> to see the whitelist.
--first-frame
string
Anchor image for image-to-video. URL, local path, or data: URI. Strongly recommended for portrait orientation when the prompt has wide-shot bias.
--audio
boolean
Enable model-native audio. Supported on veo-3.1, kling-v3.0-pro (EN only — accent slip and age drift on RU), seedance-2.0, and most modern i2v endpoints. See MODELS.md per-model audio column.
--dry-run
boolean
Validate params, print the resolved request and cost estimate; do not submit.
The CLI validates --duration, --aspect-ratio, --resolution, and --first-frame / --last-frame against the per-model whitelist from the OpenRouter catalog before submitting. If your params don’t fit the model, you get a E_VALIDATION_FAILED with the violated field and a suggestion. Override with --no-validate if you know what you’re doing.

Voiceover — generate voiceover

ralphy generate voiceover \
  --project syrup-001 \
  --slot scene-01-vo \
  --voice elevenlabs-voice-xyz \
  --text "My name is Anna, and I brew the best coffee in town." \
  --stability 0.55 \
  --style 0.25
--voice
string
required
ElevenLabs voice id — a cloned voice or a library voice.
--text
string
required
VO text. RU or EN supported by the default eleven_multilingual_v2 model.
--stability
number
0–1, default 0.55. Lower = more variation (good for emotional / cinematic deliveries); higher = monotone (good for analog-horror PSA / robo-narrator).
--style
number
0–1, default 0.25. 0 = monotone broadcast register, 1 = full dramatic. The analog-horror postmortem documented style 0 + stability 0.5 as the cold-robo-female PSA register.

Music — generate music

ralphy generate music \
  --project syrup-001 \
  --slot bed-01 \
  --prompt "warm acoustic indie folk, 90 BPM, fingerpicked guitar, soft brushed kit" \
  --duration 30
Instrumental is the default. Pass --with-vocals if you actually want vocals — usually you don’t for ad work, since ElevenLabs Music’s ToS blocks named-artist references and the post-mix sidechain-duck under your voiceover sounds cleaner instrumental.

Captions — generate captions

ralphy generate captions \
  --project syrup-001 \
  --audio workspace/projects/syrup-001/assets/scene-01-vo.mp3 \
  --language ru
Output lands at workspace/projects/<id>/assets/captions/<slot>.json (default) — per-slot captions, not the legacy shared captions.json (which clobbered between calls in the noski and venom postmortems). Pass --legacy-output if you have scripts that grep the old path.

Cost preview with --dry-run

Every paid sub-verb supports --dry-run. The output names the model, the resolved request, the file Ralphy would write, and the cost estimate — all free.
ralphy generate video \
  --project syrup-001 \
  --slot scene-02-vid \
  --prompt "Anna pours syrup, slow tilt-up to her smile" \
  --duration 5 \
  --first-frame ./assets/scene-02-bg.png \
  --dry-run
{
  "dryRun": true,
  "model": "kwaivgi/kling-v3.0-pro",
  "slot": "scene-02-vid",
  "durationSec": 5,
  "aspectRatio": "9:16",
  "resolution": "720p",
  "firstFrame": "[ref-supplied]",
  "generateAudio": false,
  "estimatedCostUsd": 1.2
}
Always --dry-run before a 10-second kling call or a long ElevenLabs Music render.

The reference-required gate

If your slot is for a named real entity and no --ref is attached, the agent layer refuses before the CLI ever fires. If you genuinely want to skip the gate on a specific call, pass --no-ref-consent "<reason>" — non-empty string required. The CLI logs stage: "no-ref-consent" to user-prompts.jsonl. Detail: Brands, personas, refs.