Models registry

MODELS.md at the repo root is the single source of truth for which models Ralphy uses, what each one costs, and what’s known to break. The agent reads it before every model call (AGENTS invariant #6) because Claude’s training is stale and OpenRouter versions drift silently. This page documents the file’s structure so contributors can add entries that read cleanly to both humans and the agent.

File shape

MODELS.md is plain Markdown. There is no separate machine schema — the file is the schema. Sections are stable and ordered:

# Models registry

> Last reviewed: YYYY-MM-DD.

## How to use this file
## Image generation
## Video generation (text-to-video + image-to-video)
## Tried-and-dropped (postmortem cross-reference)
## Voiceover (TTS)
## Music generation
## Audio transcription / Captions
## LLM (for skills and analytics)
## Out-of-scope / dropped
## When to update this file

Each modality section has the same internal layout:

A one-sentence statement of the endpoint contract (which file in cli/lib/providers/ handles the call).
A matrix table of supported models — one row per model.
A “When to pick which” decision table.
A “Lessons” or “Discovered breakage” numbered list, postmortem-cited.
An Avoid: bullet list.

The agent reads sections out of order. Each section must be self-contained.

The opening contract

The file’s opener pins the two-key rule and points at the freshness gate:

A short, opinionated list of the models we actually use. **Two API keys only:**
`OPENROUTER_API_KEY` for media / LLM / transcription, `ELEVENLABS_API_KEY` for voice
and music. Everything else is out of scope.

> **Last reviewed: 2026-05-08.** If this file is older than 30 days, re-check the
> models on OpenRouter — versions drift silently.

The Last reviewed line is the freshness signal. The session-start meta playbook checks this on the first call (docs/playbooks/meta.md rule 2). Bump the date in the same commit as any factual edit. Don’t bump without an edit — that defeats the gate.

Per-modality matrix tables

Every modality section has a table with the same column order. Here is the canonical shape for image:

| Use case | Model | Price | Why |
|---|---|---|---|
| **Default — multi-ref / character consistency** | `google/gemini-3-pro-image-preview` | ~$0.15 / image | Best at holding face / wardrobe / product identity across 2-3 refs. Tolerates ≥4 concurrent calls. Nano-banana-pro lineage. Default since 2026-05-20. |
| **Premium typography / label accuracy** | `openai/gpt-5.4-image-2` | ~$0.20 / image | Best typography on labels, cleanest hero product shots where the wordmark must read crisp. Caps at 1 concurrent — serialize batches. |
| **Budget OpenAI** | `openai/gpt-5-image-mini` | ~$0.08 / image | Cheap iteration during prompt exploration. |
| **Cheapest viable** | `google/gemini-2.5-flash-image` | ~$0.02 / image | Smoke-test only — quality dip is visible. |

Column conventions:

Use case: one short phrase. The first row’s use case is always **Default — <bucket>** (bold). Subsequent rows narrow.
Model: full OpenRouter path in backticks (provider/model-id). Never shorten to “Kling” inside the cell.
Price: leading ~ for ballpark, no ~ once empirically verified. Format ~$0.20 / image, $0.14 / sec, subscription for ElevenLabs.
Why: one sentence. Lead with the verb. No marketing language.

Video has an extra matrix

Video gets a second table — the live catalog snapshot — because the OR catalog drives runtime validation in cli/commands/generate.ts → validateVideoParams():

| Model | Durations (s) | Resolutions | Aspects | Frame anchors | $/sec billed |
|---|---|---|---|---|---|
| `kwaivgi/kling-v3.0-pro` | 3-15 | 720p | 9:16, 16:9, 1:1 | first + last | $0.14 ✓ |
| `google/veo-3.1-fast` | 4, 6, 8 | 720p, 1080p, 4K | 9:16, 16:9 | first + last | $0.14 ✓ |
| `bytedance/seedance-2.0` | 4-15 | 480p, 720p, 1080p | 7 aspects incl 21:9 | first + last | $0.14 ✓ |

Columns:

Durations (s): list discrete values (6, 10) or a closed range (3-15). Match the model’s supported_durations array from ralphy models list.
Resolutions: comma-list, lowest first. 720p, 1080p, 4K are the canonical labels.
Aspects: comma-list of W:H strings. 7 aspects is shorthand for “all seven supported aspects” — only use it for seedance.
Frame anchors: first only or first + last. Drives whether --last-frame is legal on this model.
$/sec billed: rate with ✓ when verified against actual OR billing. Without ✓ it’s a ballpark from the catalog — verify on first use and add the tick.

Decision table

Every modality has a “When to pick which” table directly below the matrix. Keep it 4-8 rows. Each row is one bolded user need plus one model name. No edge cases — those go in the lessons section.

Lessons / discovered breakage

The numbered list under “Lessons from this session” or “Discovered breakage” is the most valuable section in the file. Each entry has a fixed shape:

1. **`kwaivgi/kling-v3.0-pro` rotates "wide" prompts inside the 9:16 container.**
   Phrases like *"wide overhead cityscape"* bias the model toward landscape
   composition; OR returns a 1080×1920 file but the content is laid out for 16:9.
   **Fix:** anchor with `--first-frame` `<portrait-image>` and rewrite the prompt
   with explicit vertical wording. The first-frame image overrides the model's
   compositional bias.

Required pieces:

Lead sentence in bold. Names the model and the surprising behaviour.
Body. What you observed, with concrete prompt fragments.
Fix: — the workaround. Always present; if there is no fix yet, write Fix: TBD — open issue #N.
Postmortem cite at end of paragraph (Postmortem: glitter-cream.). The agent uses this to pull the original session if it needs more context.

Numbering is append-only. When a new lesson lands, take the next integer and add it at the bottom — do not renumber. Postmortems that pre-date the list cross-reference by number, and renumbering invalidates the cross-refs.

Tried-and-dropped table

The cross-reference table at the bottom of the video section is the bridge between “this model” and “this postmortem”:

| Model | Context where it failed | Why | Postmortem |
|---|---|---|---|
| `bytedance/seedance-2.0` | photoreal-human i2v anchors | privacy filter `InputImageSensitiveContentDetected` | tokyo, noski, venom |
| `kwaivgi/kling-v3.0-pro --audio` | non-English VO | accent slip + voice-age drift | noski, venom |

Rows are append-only. When a model gets re-validated (e.g. a provider patch fixes the issue), do not delete the row — flag the fix in the “Why” cell (— mitigated 2026-05-19 by …) so future agents see the history.

Adding a new entry

Confirm the model is live on OpenRouter. Run ralphy models list (24h cached, refresh with --refresh) and grep for the model id.
Pick the section. Image / video / voice / music / transcription / LLM. If the modality is new, propose a new section in a separate PR.
Decide the use case. What problem does it solve that the current default does not? If the answer is “none”, the entry does not go in.
Insert a row in the matrix table. Use the column shape above. Price is ~$X.XX / unit until you have billed it once; then drop the ~.
Add a “When to pick which” line if the model owns a distinct niche.
If the model behaves surprisingly, add a numbered lesson with the postmortem cite.
Bump the Last reviewed date at the top of the file in the same commit.

There is no validator script — MODELS.md is human-curated prose. The discipline is enforced by code review and the cross-reference checks in the providers test suite (any model id mentioned in cli/lib/providers/media.ts should also appear in MODELS.md).

Lifecycle

On the first session in a new chat: check Last reviewed. Refresh if stale.
After every failure mode on a new model: add it to “Avoid” / “Lessons” with the reason and a postmortem cite.
When you change a default in a skill or script: sync it here in the same commit.
When you add a verb or flag to ralphy generate: sync the price / param notes.
At least once a month: re-check OR catalog drift, bump the date.

Why one file, not a JSON catalog

MODELS.md is prose because the agent reads it as prose. A JSON catalog would force the agent to render it back into prose at decision time, costing tokens and losing the “why” context. The OR catalog is the JSON source-of-truth for parameter validation (supported_durations, supported_resolutions); MODELS.md is the prose layer on top — the rationale, the lessons, the postmortem cross-references. If you need machine-readable data, parse the matrix tables — the columns are stable.

CLI: ralphy models — live OR catalog access.
MODELS.md — the file itself.
AGENTS.md — invariant #6 (read MODELS.md before every model call).
cli/lib/providers/media.ts — the single gateway for every media call.

​File shape

​The opening contract

​Per-modality matrix tables

​Video has an extra matrix

​Decision table

​Lessons / discovered breakage

​Tried-and-dropped table

​Adding a new entry

​Lifecycle

​Why one file, not a JSON catalog

​Related