Generating assets

ralphy generate is the single CLI gate for every model call. Image, video, voiceover, music, sfx, captions — all five sub-verbs land here, all log to the same JSONL, all update the same asset-manifest.json, all version automatically on regen. Most of the time the agent drives these on your behalf during the one-beat-at-a-time intake loop. You’ll call them by hand when you want to regenerate a single slot, sweep variants, or preview cost before firing. The verb’s whole job is to be the choke point. No raw curl, no bunx tsx against a media API, no ffmpeg shells — every recipe lives behind a ralphy verb so the gen-log, the manifest, and the cost rollup all stay honest. That’s AGENTS invariant #2.

The five sub-verbs at a glance

Sub-verb	Default model	Output	Per-call cost (typical)
`generate image`	`google/gemini-3-pro-image-preview` (multi-ref / character consistency); `openai/gpt-5.4-image-2` for premium typography	PNG, slot-named	~$0.04
`generate video`	`kwaivgi/kling-v3.0-pro`	MP4, slot-named	$0.30 –$ 2.40
`generate voiceover`	`elevenlabs/eleven_multilingual_v2`	MP3, slot-named	~$0.30 per 1k chars
`generate music`	ElevenLabs Music (instrumental default)	MP3, slot-named	~$0.005/sec
`generate sfx`	ElevenLabs Sound Generation (≤22s)	MP3, slot-named	flat per call
`generate captions`	ElevenLabs Scribe v1 (word-level)	JSON `Caption[]`, slot-named	~$0.005 per minute

Defaults are the live values from cli/commands/generate.ts as of v0.3.0. Always cross-check MODELS.md before assuming — Claude’s training is stale, and Ralphy reads MODELS.md before every call for a reason.

When the agent drives, when you drive

Most generation happens implicitly during the intake loop. Ralphy generates the location-master-plate, the persona masters, then scene anchors one at a time, surfacing each to you for approval. You don’t type the commands; you say “go” and Ralphy fires them. The CLI invocation is identical either way — what changes is who’s at the keyboard. You’ll run ralphy generate yourself in these cases:

Regenerate one slot after a miss — “scene-03 looks off, try seedance”.
Sweep variants — --variants 4 on an image to compare A/B/C/D in parallel.
Preview cost before firing a long video — --dry-run returns the resolved request and the cost estimate without spending money.
Model swap — pass --model <id> to override the default for a single call.
Recovery mid-batch — one slot of a batch failed; rerun just that slot.

Slot IDs and the manifest

Every generated file lands in a slot, and the slot id is the file name’s prefix. The convention is {scene-id}-{type}-{descriptor}:

scene-01-bg-image      → .ralphy/workspaces/default/projects/<id>/assets/scene-01-bg-image.png
scene-01-vid           → .ralphy/workspaces/default/projects/<id>/assets/scene-01-vid.mp4
scene-01-vo-primary    → .ralphy/workspaces/default/projects/<id>/assets/scene-01-vo-primary.mp3
bed-01                 → .ralphy/workspaces/default/projects/<id>/assets/bed-01.mp3

The slot id is canonical kebab-case. Pass anything close (uppercase, underscores) and Ralphy auto-normalizes with a stderr warning so you learn the canonical form for next time. The asset-manifest.json at the project root tracks every slot:

{
  "slots": {
    "scene-01-bg-image": {
      "kind": "image",
      "path": ".ralphy/workspaces/default/projects/syrup-001/assets/scene-01-bg-image.png",
      "model": "google/gemini-3-pro-image-preview",
      "costUsd": 0.04,
      "generatedAt": "2026-05-19T14:32:11.483Z"
    }
  }
}

Read it any time with ralphy project show <id> --assets. Detail in Asset manifest.

Versioning on regen

When you regenerate a slot that already exists, the new file lands at .v2.<ext>, then .v3, .v4, and so on. The existing file is preserved unchanged. The manifest tracks both; only “promoting” a chosen variant on your explicit say-so flips the manifest pointer to a new winner.

assets/scene-03-vid.mp4       ← original (v1, the manifest pointer)
assets/scene-03-vid.v2.mp4    ← first regen
assets/scene-03-vid.v3.mp4    ← second regen

This is AGENTS invariant #13. The --force-overwrite flag bypasses it and writes in place — you almost never want this. Detail in Reviewing and iterating.

Image — `generate image`

The bread-and-butter still gen.

ralphy generate image \
  --project syrup-001 \
  --slot scene-02-bg \
  --prompt "Anna pours syrup into iced coffee, kitchen counter, autumn light" \
  --ref .ralphy/workspaces/default/projects/syrup-001/assets/product-master.png \
  --ref .ralphy/workspaces/default/projects/syrup-001/assets/persona-master.png \
  --size 1080x1920

string

OpenRouter model id. Default google/gemini-3-pro-image-preview (multi-ref / character consistency, nano-banana-pro lineage, ~

0.15/image, tolerates ≥4 concurrent calls). Switch to `openai/gpt-5.4-image-2` for premium typography on labels and hero product shots where the wordmark must read crisp (~

0.20/image, caps at 1 concurrent).

string[]

Reference image(s). URL, local path, or data: URI. Local paths auto-convert to data: URI. Repeat the flag to pass multiple refs.

string

Size hint, default 1080x1920. Passed as prompt-level guidance — gemini and gpt image models don’t accept exact pixel dimensions and round to their natural sizes.

number

Generate N parallel variants. Writes <slot>-v1.png through <slot>-vN.png. Capped at 8.

string

Negative prompt — what the image should not contain.

boolean

Print the resolved request and the cost estimate; do not submit. Always free.

Video — `generate video`

The expensive call. Always --dry-run first if you’re not sure.

ralphy generate video \
  --project syrup-001 \
  --slot scene-02-vid \
  --prompt "Anna pours syrup, slow tilt-up to her smile, 35mm naturalistic" \
  --duration 5 \
  --first-frame .ralphy/workspaces/default/projects/syrup-001/assets/scene-02-bg.png \
  --aspect-ratio 9:16 \
  --resolution 720p

string

Default kwaivgi/kling-v3.0-pro. Switch to bytedance/seedance-2.0 for horror, POV, walking, jump-scares, or any non-default physics motion (per MODELS.md and the venom-bodywash postmortem).

number

required

Seconds. Per-model supported_durations may be discrete — hailuo accepts only 6 and 10. Run ralphy models show <id> to see the whitelist.

string

Anchor image for image-to-video. URL, local path, or data: URI. Strongly recommended for portrait orientation when the prompt has wide-shot bias.

boolean

Enable model-native audio. Supported on veo-3.1, kling-v3.0-pro (EN only — accent slip and age drift on RU), seedance-2.0, and most modern i2v endpoints. See MODELS.md per-model audio column.

boolean

Validate params, print the resolved request and cost estimate; do not submit.

The CLI validates --duration, --aspect-ratio, --resolution, and --first-frame / --last-frame against the per-model whitelist from the OpenRouter catalog before submitting. If your params don’t fit the model, you get a E_VALIDATION_FAILED with the violated field and a suggestion. Override with --no-validate if you know what you’re doing.

Voiceover — `generate voiceover`

ralphy generate voiceover \
  --project syrup-001 \
  --slot scene-01-vo \
  --voice elevenlabs-voice-xyz \
  --text "My name is Anna, and I brew the best coffee in town." \
  --stability 0.55 \
  --style 0.25

string

required

ElevenLabs voice id — a cloned voice or a library voice.

string

required

VO text. RU or EN supported by the default eleven_multilingual_v2 model.

number

0–1, default 0.55. Lower = more variation (good for emotional / cinematic deliveries); higher = monotone (good for analog-horror PSA / robo-narrator).

number

0–1, default 0.25. 0 = monotone broadcast register, 1 = full dramatic. The analog-horror postmortem documented style 0 + stability 0.5 as the cold-robo-female PSA register.

Music — `generate music`

ralphy generate music \
  --project syrup-001 \
  --slot bed-01 \
  --prompt "warm acoustic indie folk, 90 BPM, fingerpicked guitar, soft brushed kit" \
  --duration 30

Instrumental is the default. Pass --with-vocals if you actually want vocals — usually you don’t for ad work, since ElevenLabs Music’s ToS blocks named-artist references and the post-mix sidechain-duck under your voiceover sounds cleaner instrumental.

Captions — `generate captions`

ralphy generate captions \
  --project syrup-001 \
  --audio .ralphy/workspaces/default/projects/syrup-001/assets/scene-01-vo.mp3 \
  --language ru

Output lands at .ralphy/workspaces/default/projects/<id>/artifacts/captions/<slot>.json (default) — per-slot captions, not the legacy shared captions.json (which clobbered between calls in the noski and venom postmortems). Pass --legacy-output if you have scripts that grep the old path.

Cost preview with `--dry-run`

Every paid sub-verb supports --dry-run. The output names the model, the resolved request, the file Ralphy would write, and the cost estimate — all free.

ralphy generate video \
  --project syrup-001 \
  --slot scene-02-vid \
  --prompt "Anna pours syrup, slow tilt-up to her smile" \
  --duration 5 \
  --first-frame ./assets/scene-02-bg.png \
  --dry-run

{
  "dryRun": true,
  "model": "kwaivgi/kling-v3.0-pro",
  "slot": "scene-02-vid",
  "durationSec": 5,
  "aspectRatio": "9:16",
  "resolution": "720p",
  "firstFrame": "[ref-supplied]",
  "generateAudio": false,
  "estimatedCostUsd": 1.2
}

Always --dry-run before a 10-second kling call or a long ElevenLabs Music render.

The reference-required gate

If your slot is for a named real entity and no --ref is attached, the agent layer refuses before the CLI ever fires. If you genuinely want to skip the gate on a specific call, pass --no-ref-consent "<reason>" — non-empty string required. The CLI logs stage: "no-ref-consent" to user-prompts.jsonl. Detail: Brands, personas, refs.

Reviewing and iterating — versioning, promoting a winner
Brands, personas, refs — the --ref flag and the gate
Rendering — what generate feeds into
CLI: generation verbs — every flag, every sub-verb
Prompt library — battle-tested prompt entries indexed by goal
MODELS.md — per-model pricing, lifecycle, parameters
cli/commands/generate.ts — source of truth

​The five sub-verbs at a glance

​When the agent drives, when you drive

​Slot IDs and the manifest

​Versioning on regen

​Image — generate image

​Video — generate video

​Voiceover — generate voiceover

​Music — generate music

​Captions — generate captions

​Cost preview with --dry-run

​The reference-required gate

​Related

The five sub-verbs at a glance

When the agent drives, when you drive

Slot IDs and the manifest

Versioning on regen

Image — `generate image`

Video — `generate video`

Voiceover — `generate voiceover`

Music — `generate music`

Captions — `generate captions`

Cost preview with `--dry-run`

The reference-required gate

Related