docs: plan audio/video context support#669
Conversation
Closes #668 Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Review: PR #669 —
|
Greptile SummaryThis PR adds a design plan (
|
| Filename | Overview |
|---|---|
| plans/668/audio-video-context.md | New planning document describing AudioContext/VideoContext API design, canonical block schema, provider translation boundaries, legacy migration strategy, and test plan. No executable code changed. |
Sequence Diagram
sequenceDiagram
participant User as User Config
participant LLMText as LLMTextColumnConfig
participant CtxObj as AudioContext / VideoContext
participant Engine as ColumnGeneratorWithModel
participant Prompt as prompt_to_messages()
participant Facade as ModelFacade
participant OAI as OpenAICompatibleClient
participant Anth as AnthropicClient
participant Provider as Provider API
User->>LLMText: "multi_modal_context=[AudioContext(...), VideoContext(...), ImageContext(...)]"
Note over LLMText: Pre-validator injects modality=image for legacy dicts
LLMText->>CtxObj: discriminated-union validation via MultiModalContextT
Engine->>CtxObj: _build_multi_modal_context(record, base_path)
CtxObj-->>Engine: canonical blocks
Engine->>Prompt: pass canonical blocks + text prompt
Prompt-->>Facade: user message
Facade->>OAI: ChatCompletionRequest
OAI->>OAI: capability gate + translation
OAI->>Provider: provider payload
Facade->>Anth: ChatCompletionRequest
Anth->>Anth: capability gate
alt audio or video block
Anth-->>Facade: ProviderError.unsupported_capability
else image block
Anth->>Provider: Claude image block
end
Reviews (4): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile
| - OpenAI-compatible: | ||
| - Translate canonical `image` blocks to `image_url`. | ||
| - Translate canonical base64 `audio` blocks to `input_audio`. | ||
| - Translate supported `video` blocks to provider-specific media content parts only when the provider route supports them. |
There was a problem hiding this comment.
I think this needs one more concrete capability gate before implementation. Today DataDesigner only knows broad adapter capabilities like chat/image/embedding, but this plan needs modality/source/format decisions, like whether this provider/model accepts audio URL or video base64. Without that, unsupported media will likely fall through to raw provider 400s instead of the canonical unsupported error the plan is aiming for.
There was a problem hiding this comment.
Addressed in b152eca: added an adapter-level capability gate for modality, source type, and media type before transport, raising ProviderError.unsupported_capability / ProviderErrorKind.UNSUPPORTED_CAPABILITY for unsupported combinations.
Tighten the plan around legacy image-context migration, audio/video auto-detection, config-layer canonical blocks, capability gating, ImageColumnConfig scope, and single-PR implementation rollout. Refs #668
📋 Summary
Adds a design plan for supporting audio and video as multimodal context, following the existing image-context pattern. The plan keeps the issue scope limited to design work while clarifying config boundaries, provider translation responsibilities, and test coverage for a future implementation.
🔗 Related Issue
Closes #668
🔄 Changes
plans/668/audio-video-context.mdwith the proposedAudioContext/VideoContextAPI shape.🧪 Testing
make testpasses (not run; markdown-only planning change)✅ Checklist
plans/668/)