Loading workspace insights... Statistics interval
7 days30 daysLatest CI Pipeline Executions
3a347c3b feat: multimodal prompt for generateImage/generateVideo (image-to-image, image-to-video) (#624)
* feat(ai): add imageInputs / videoInputs / audioInputs for image-conditioned generation (closes #618)
Adds optional `imageInputs`, `videoInputs`, and `audioInputs` to `generateImage()`
and `generateVideo()` for image-to-image, multi-reference, mask / inpaint,
image-to-video, and starting-frame flows. Each input part may carry a
`metadata.role` hint (`'reference' | 'mask' | 'control' | 'start_frame' |
'end_frame' | 'character'`) that adapters use to route to the provider-specific
field.
Provider behavior:
- OpenAI image: gpt-image-1 / -mini route to `images.edit()` (up to 16 + mask);
dall-e-2 routes to `images.edit()` with one source; dall-e-3 throws.
- OpenAI video: Sora-2 / -pro accept a single `input_reference`; throws on >1.
- Gemini: native models receive inputs as multimodal `contents` parts; Imagen
throws (text-only).
- fal: 1 input → `image_url`, >1 → `image_urls`; metadata roles map to
`mask_url` / `control_image_url` / `reference_image_urls`; video adds
`start_image_url` / `end_image_url`. Interim mapping until the fal schemas
library lands.
- Grok, OpenRouter: throw with a link back to #618 (pending native Imagine API
rewrite and multimodal injection work respectively).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci: apply automated fixes
* feat(ai-fal): resolve image-input fields per endpoint from generated SDK type map
Replace the fal image-input field heuristic with a per-endpoint mapping
generated from @fal-ai/client's EndpointTypeMap (scripts/
generate-fal-image-field-map.ts, run via pnpm generate:fal-image-fields).
The committed artifact stores only the 362 endpoints whose field names
deviate from the defaults (e.g. nano-banana edit -> image_urls, Kling i2v
start frame -> image_url, Veo first-last-frame -> first_frame_url /
last_frame_url, Fooocus masks -> mask_image_url); the old heuristic
remains the fallback for endpoints newer than the installed SDK.
Safety rails: the generated file `satisfies`-checks every field name
against the SDK endpoint types (type-only, erased at runtime), and a unit
test hashes the installed endpoints.d.ts against the recorded hash so an
SDK bump without regeneration fails test:lib with the regen command.
Mappers are now typed: both return FalImageInputFields<TModel>, Pick'ed
from the endpoint's real input type via a generated field-name union.
Roles resolving to the same list field merge (source + reference on
nano-banana); colliding scalar fields throw instead of overwriting.
Also fixes the remaining CI lint failures: duplicate @tanstack/ai import
and non-null assertion in ai-fal video.ts, switch-exhaustiveness errors
in image-inputs.ts (restructured away), and the non-null assertion in
ai-openai image.ts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(ai-grok,ai-openrouter): support imageInputs for image-conditioned generation
Grok: add the xAI Imagine API image models (grok-imagine-image,
grok-imagine-image-quality) to model-meta. With imageInputs they route to
xAI's JSON POST /v1/images/edits endpoint via direct fetch (the OpenAI
SDK's images.edit() sends multipart/form-data, which xAI rejects) — a
single input as image:{url}, 2-3 inputs as images:[...] referenceable in
the prompt as <IMAGE_0>/<IMAGE_1>; >3 inputs and mask/control roles throw.
Their generic `size` uses an aspectRatio_resolution template ('16:9_2k',
suffix optional), mirroring Gemini's native image models, and maps to the
Imagine aspect_ratio/resolution parameters on both the generate and edit
paths. grok-2-image-1212 stays text-to-image only with a clear error.
OpenRouter: imageInputs are injected as multimodal image_url content parts
alongside the prompt in the chat-completions message and forwarded to the
underlying image model.
Neither path fetches or base64-encodes URL sources in-process — URLs pass
through verbatim and are fetched by the provider; data sources become data
URIs. Bumps ai-grok and ai-openrouter to minor in the existing changeset.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: adapt #618 branch to the packages/ restructure and post-rebase API drift
- Move the generated fal image-field map and the generator's paths from
packages/typescript/ai-fal to packages/ai-fal (repo flattened the layout)
- Add gpt-image-2 to EDIT_MAX_IMAGES (new model on main; same 16-image
edit limit as the other gpt-image models)
- Map edit-path usage through buildImagesUsage to match the new TokenUsage
shape, and drop two now-unnecessary type assertions
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(ai): make prompt multimodal for generateImage/generateVideo, pass text through verbatim
Replace the imageInputs / videoInputs / audioInputs fields with a multimodal
prompt: string | MediaPromptPart[]. Part order is meaningful — natively
multimodal providers (Gemini, OpenRouter) receive parts in interleaved order;
named-field providers (OpenAI, fal, xAI) extract media parts via the new
resolveMediaPrompt() utility and flatten the text.
Zero magic: prompt text is always sent verbatim. The SDK never injects or
rewrites in-prompt referencing markers — users write each provider's own
convention (fal Kling/Seedance @Image1, OpenAI/FLUX.2 "image 1" prose, Gemini
content descriptions), now documented per provider in the media docs. An
earlier grok <IMAGE_n> auto-injection was removed after research showed the
convention is absent from xAI's official docs (images are addressed by
request order).
- Per-model compile-time prompt narrowing via TModelInputModalitiesByName
adapter generic (e.g. dall-e-3 / Imagen reject image parts as a type
error); fal modality maps are derived at the type level from the SDK's
endpoint input types
- metadata.tag added as an informational label (never read by adapters)
- Gemini now preserves true interleaving in contents; OpenRouter maps parts
1:1 onto chat content parts in order
Closes #618
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: address PR review findings for image/video input support
- openai: add gpt-image-2 to the editImages error message and JSDoc
(the model is edit-capable via EDIT_MAX_IMAGES but was omitted from
user-facing guidance); same fix in docs, SKILL.md, and the changeset
- openai: throw when the images.edit() response contains no usable
images (matching grok's guard) instead of resolving to { images: [] }
- openai: drop the unnecessary input_reference cast in the Sora
adapter — the SDK types the field, so assign directly
- fal: reject metadata.role 'mask'/'control' in the video mapper
instead of silently folding them into source frames
- docs: mark Veo role mappings as planned (no Veo adapter yet), note
the Gemini ~14-image limit is provider-side, bump samples to
gpt-image-2
- tests: cover the Gemini image-conditioned path (interleaved
contents, fileData vs inlineData vs fetch+inline, Imagen/video/audio
rejection), the Sora input_reference upload and guards (new file),
the fal video createVideoJob field assembly and audio guard, and the
openai empty-edit-response guard
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(ai-openai): throw on empty generateImages responses too
Same defect class as the editImages guard in the previous commit: the
text-to-image path silently resolved to { images: [] } when response
items had neither b64_json nor url. Surface it as an error instead.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: client-side multimodal prompts, e2e coverage, media example, fal field demotion
- ai-client: widen ImageGenerateInput.prompt / VideoGenerateInput.prompt
from string to MediaPrompt so useGenerateImage/useGenerateVideo can
carry image parts from the browser; re-export the MediaPrompt types
from @tanstack/ai/client
- ai-fal: demote media-conditioning fields (FalImageFieldName set plus
video_url/video_urls/reference_video_urls/audio_url) from required to
optional in FalImageProviderOptions / FalVideoProviderOptions — i2v
endpoints declare e.g. image_url as required, but with a multimodal
prompt the start frame arrives as a prompt part; modelOptions stays
available as the explicit escape hatch
- e2e: real coverage for image-to-image (OpenAI /v1/images/edits) and
image-to-video (Sora multipart /v1/videos with input_reference) — the
installed aimock 1.29 mocks both multipart endpoints, so the previous
"aimock can't mock this" empty provider sets were stale. New specs run
all three transports and assert via aimock's request journal that the
expected wire endpoint was hit. ImageGenUI/VideoGenUI gain a file
input, feature routing/fixtures/onVideo registration added, README
matrix updated
- examples/ts-react-media: ImageGenerator gains a multi-image reference
picker (Gemini native models); VideoGenerator sends the start frame as
a prompt part with role 'start_frame' instead of modelOptions URLs;
server functions narrow the wire prompt per model and throw on
unsupported part kinds instead of dropping them
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: address CodeRabbit review findings
- fal image/video: spread modelOptions after derived media fields so
explicit user overrides win (matches documented intent)
- openai video: validate effective size (size ?? modelOptions.size)
- generate-fal-image-field-map: run arity check for default-selected
fields too
- ts-react-media example: correct reference-image support comment
(Gemini multimodal models, not NanoBanana)
- e2e VideoGenUI: reject on malformed data URL instead of resolving ''
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(ai,ai-gemini): add Google Veo video adapter on the typed-duration contract (#634)
Restacked on 618-image-to-image-and-image-to-video-support to adopt the
multimodal MediaPrompt format, carrying a minimal additive port of the
#534 typed-duration contract:
- @tanstack/ai (non-breaking): VideoAdapter/BaseVideoAdapter gain a
TModelDurationByName generic (default Record<string, number> preserves
existing duration?: number typing), DurationOptions, snapToDurationOption,
and default availableDurations()/snapDuration() implementations.
generateVideo's duration is typed via VideoDurationForAdapter.
- @tanstack/ai-gemini: GeminiVideoAdapter over generateVideos /
getVideosOperation with per-model typed durations (Veo 3.x 4|6|8,
Veo 2 5|6|8 per current Veo docs), MediaPrompt image routing
(start_frame → image, end_frame → lastFrame, reference/character →
referenceImages), RAI filter surfacing, geminiVideo/createGeminiVideo
factories, and finalized Veo model-meta entries.
- E2E: gemini added to video-gen with a custom aimock mount for
:predictLongRunning + operations polling; all transports pass.
- Docs + media-generation skill updated for Veo (typed durations,
image-to-video role table).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>