Skip to main content
You’ll type one sentence in chat, answer 3-5 clarifying questions, watch Ralphy generate scenes one at a time, and end with final.mp4 on disk. Target time: 5-8 minutes cold-start, per the perf targets. This page assumes you’ve already cloned ugc-cli and opened it in Claude Code — if not, run through Install first.
1

Open the project in Claude Code

From the directory you cloned into:
cd ugc-cli
claude
In chat, confirm the agent sees the routing table:
are you reading AGENTS.md?
The agent should confirm and list the playbooks it routes to (intake, researcher, scenarist, art-director, editor, producer). If it doesn’t, you’re probably in the wrong directory — pwd should end in /ugc-cli. On Cursor / Copilot / Codex, run ralphy skill install once — see Connect your editor.
2

Type your brief in chat

Drop a one-liner into chat. Be specific about platform, vibe, and POV — the agent uses this to pick a template and route the generation pipeline.
make a 15-second TikTok about my espresso bar, morning vibe, selfie POV
The agent matches the brief against the intake playbook and starts the protocol.
3

Answer the intake questions

The agent will ask 3-5 clarifying questions in a single turn. Expect something like:
  1. Target audience language? EN / RU / other. Drives audio pipeline choice (Kling --audio for EN, ElevenLabs for non-EN).
  2. Aspect? 9:16 TikTok (default), 16:9 YouTube, or 1:1.
  3. Brand / named entity? Anything that names a real person, brand, or IP triggers the reference-required gate. For a generic “my espresso bar”, you’re fine without refs.
  4. Duration / clip count? 15s is the safe default for a first render; the agent confirms.
  5. Hard constraints? Banned music, brand colors, must-have shots.
Answer briefly. Two or three sentences is enough.
4

Agent picks a template

Before improvising, the agent runs:
ralphy template suggest "15s TikTok about espresso bar, morning vibe, selfie POV"
It surfaces the top-3 matches with one-line descriptions, then proposes one. For an espresso brief the agent will likely pick a creator-lifestyle vibe-reference template. Confirm with “go” or ask for a different angle.The template encodes the postmortem-validated workflow for that vibe — scene count, model picks, prompt vocabulary. You get a head start instead of starting from blank.
5

Agent generates scenes one beat at a time

The agent creates the project (espresso-001) and starts generating scene by scene. For each scene it:
  1. Writes the prompt to workspace/projects/espresso-001/prompts.json.
  2. Calls ralphy generate image --scene scene-01 ... to make the background plate.
  3. Shows you the image, asks for OK or a variant.
  4. Calls ralphy generate video --scene scene-01 ... to animate it.
  5. Calls ralphy generate voiceover --scene scene-01 ... for the VO line.
Files land in workspace/projects/espresso-001/assets/. Every model call writes an entry to logs/generations.jsonl with the input, output, and cost.
You can reject a scene at any time. Say “ask for a variant” or “make scene-02 brighter” and the agent regenerates. The old version is preserved on disk as .scene-02-bg-image.v1.png — Ralphy never overwrites a generation per AGENTS.md invariant #13.
6

Agent reviews the full sequence with you

After all scenes pass, the agent shows you the asset manifest and asks for a “go” before rendering. This is your last chance to swap a model, regenerate a shot, or change a VO line — once you say render, ffmpeg kicks in.Sample manifest:
{
  "project_id": "espresso-001",
  "scenes": [
    { "id": "scene-01", "image": "scene-01-bg-image.png", "video": "scene-01-vid.mp4", "vo": "scene-01-vo.mp3", "duration": 5 },
    { "id": "scene-02", "image": "scene-02-bg-image.png", "video": "scene-02-vid.mp4", "vo": "scene-02-vo.mp3", "duration": 5 },
    { "id": "scene-03", "image": "scene-03-bg-image.png", "video": "scene-03-vid.mp4", "vo": "scene-03-vo.mp3", "duration": 5 }
  ],
  "music": "music-bed.mp3",
  "total_duration": 15
}
7

Render

Say “render” in chat. The agent runs:
ralphy render espresso-001
HyperFrames rasterizes the composition headlessly, ffmpeg encodes the final mp4, and final.mp4 lands in workspace/projects/espresso-001/render/. The agent reports back with the file path, duration, file size, and total spend pulled from generations.jsonl.Sample output:
✓ Rendered espresso-001
  → workspace/projects/espresso-001/render/final.mp4
  duration   15.0s
  size       4.2 MB
  spend      $0.87
  runtime    47s
8

Watch the mp4

Open the file:
open workspace/projects/espresso-001/render/final.mp4
Or drag it into QuickTime / VLC / your browser. That’s your first ralphy video.
Total time, cold start. Per the perf targets, a single 15-second video from brief to mp4 should land in ≤ 8 minutes. Most of that is model latency (Kling video gen is ~30s/scene). If you blow past 12 minutes, ask the agent for a postmortem — something’s off.

When things go sideways

  • A scene looks wrong. Tell the agent “regenerate scene-02 with X different”. The old file stays on disk as .v1; you can always promote it back.
  • Cost is climbing fast. Ask “show me the spend so far”. The agent reads generations.jsonl and gives you a per-model rollup.
  • The render fails. Run ralphy doctor — ffmpeg or bun usually went missing. The error message points at the fix.
  • You hate the template. Run ralphy template list (or ask the agent), pick another, restart with ralphy template use <slug>.

Next

You have an mp4. Now read What just happened to understand which files Ralphy wrote and why — that’s the foundation for everything you’ll do next.