Skip to content

docs: plan audio/video context support#669

Open
nabinchha wants to merge 4 commits into
mainfrom
nmulepati/feat-support-audio-video-context
Open

docs: plan audio/video context support#669
nabinchha wants to merge 4 commits into
mainfrom
nmulepati/feat-support-audio-video-context

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

📋 Summary

Adds a design plan for supporting audio and video as multimodal context, following the existing image-context pattern. The plan keeps the issue scope limited to design work while clarifying config boundaries, provider translation responsibilities, and test coverage for a future implementation.

🔗 Related Issue

Closes #668

🔄 Changes

  • Adds plans/668/audio-video-context.md with the proposed AudioContext / VideoContext API shape.
  • Clarifies that audio/video context values are URL or base64 only, with local path handling and provider file metadata out of scope for v1.
  • Documents provider translation boundaries, unsupported-provider behavior, implementation phases, and test strategy.

🧪 Testing

  • make test passes (not run; markdown-only planning change)
  • Unit tests added/updated (N/A; no testable logic)
  • E2E tests added/updated (N/A; no testable logic)
  • Commit hooks passed for markdown-safe checks

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (plan added under plans/668/)

Closes #668

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha requested a review from a team as a code owner May 15, 2026 21:24
@github-actions
Copy link
Copy Markdown
Contributor

Review: PR #669docs: plan audio/video context support

Summary

A plan-only PR that adds plans/668/audio-video-context.md (275 lines, no code). It proposes a declarative AudioContext / VideoContext API mirroring the existing ImageContext pattern, a canonical media-block schema as the boundary between config and adapters, provider translation rules, and a 3-PR rollout. The scope is constrained to media-as-input-context — no generation, no heavy deps in config import paths, no local-path handling for audio/video in v1.

Findings

Strengths

  • Architectural alignment is solid. The plan explicitly preserves the interface → engine → config import direction and the "declare, don't orchestrate" contract from AGENTS.md. The canonical media-block schema lives between config (descriptive) and adapters (provider-specific), which keeps OpenAI/Anthropic shapes from leaking back into config — consistent with the existing principle that "errors normalize at boundaries."
  • Conservative non-goals. Excluding ffmpeg/moviepy from config imports, deferring frame-extraction fallbacks, deferring the Responses API migration, and deferring provider file-upload lifecycle are all the right calls for v1 and prevent scope creep.
  • Clear adapter responsibility split. Image stays unchanged on Anthropic; audio/video raise canonical unsupported errors before transport. OpenAI-compatible owns image_url / input_audio translation. This is the right place for capability gating.
  • Good rollout atomicity. Three PRs (config foundation → engine/adapters → docs) match the package layering and let the config layer ship and be tested independently.
  • Risks section is honest about video provider variance, base64 size blowup, and the silent-download trap for remote URLs.

Gaps and ambiguities

  1. Discriminated-union migration of ImageContext is under-specified. Verified locally: ImageContext.modality is currently Modality = Modality.IMAGE (not Literal). Step 1 says to "change concrete context modality fields to Literal[...] values." That's a schema change on an existing public type. The plan should call out:

    • whether previously serialized configs (which may omit modality and rely on the default) still round-trip — Pydantic discriminated unions typically require the discriminator key to be present;
    • whether the column-config field type change from list[ImageContext] | None to list[MultiModalContextT] | None is a breaking type-narrowing for downstream type-checkers or plugin authors who annotated against ImageContext.

    A one-line backward-compat note in the Design Decisions table (alongside the existing "legacy image_url" entry) would close this.

  2. data_type=None auto-detection behavior for audio/video is unspecified. ImageContext auto-detects URL vs base64 vs local path when data_type is omitted. The plan says audio/video are URL-or-base64 only, but doesn't say whether data_type is required for the new contexts or whether auto-detection is supported (URL vs base64). Either choice is fine, but the plan should pick one — otherwise implementers will diverge. Test Plan mentions "explicit base64 requires an explicit format" but not the data_type=None case.

  3. Canonical error type isn't named. Step 5 and the Test Plan reference "canonical unsupported-capability provider error" but don't say whether this is an existing class in clients/errors.py or a new one (e.g., UnsupportedModalityError). Naming it in the plan avoids ad-hoc choices in PR 2.

  4. Missing: schema-export impact. DataDesigner exports JSON schema for configs (used by docs/UI). Switching multi_modal_context to a discriminated union changes the emitted schema shape. Worth a one-liner in the Test Plan ("schema export snapshot updated and reviewed") or Risks.

  5. Step 2's "extract shared helpers" lacks a deprecation policy. The plan says to move reusable normalization out of image_helpers.py into media_helpers.py, but doesn't say whether the old image-helper public surface is preserved, deprecated, or re-exported. If anything imports from image_helpers.py outside config (or in plugins), this matters.

  6. MIME / format mapping is only sketched. The Canonical Block Shape gives audio/mpeg for mp3 and video/mp4 for mp4, but Step 2 just says "Add MIME helpers for audio/video formats." A small explicit table (format enum → media_type → OpenAI format value) would prevent inconsistency between Step 2 and Step 5.

  7. Open question worth surfacing: list-valued media columns. The plan supports list/JSON-list values (consistent with ImageContext), but for audio/video this can mean N base64 blobs in one user message. Provider per-message size limits and the order of multi-block flattening should be stated, even if just as "blocks are emitted in list order, no dedup."

Minor

  • The Mermaid diagram is correct but the engine-side label _build_multi_modal_context lives on ColumnGeneratorWithModel — fine as-is; the rendering will be tight on narrow viewers.
  • "OpenAI-compatible adapter forwards image_url blocks unchanged" in Current State is accurate; it might be worth noting in Step 5 that NIM and other OpenAI-compatible providers may not all accept input_audio, so the audio path may need a per-endpoint capability flag rather than a flat "OpenAI-compatible supports audio."
  • Provider links use stable docs URLs — good. Anthropic's vision URL pinned to platform.claude.com is the current canonical host.

Verdict

Approve with revisions. The plan is well-scoped, architecturally aligned, and rolls out cleanly across the three packages. Before PR 1 lands, the plan should be tightened on: (a) the ImageContext modality field migration and config backward compatibility, (b) data_type=None auto-detection policy for the new contexts, (c) the named canonical error class, and (d) explicit MIME / format mapping. None of these block the design itself; they're loose ends that will otherwise be re-litigated in the implementation PRs.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR adds a design plan (plans/668/audio-video-context.md) for extending DataDesigner's multimodal context support to include AudioContext and VideoContext, following the existing ImageContext pattern. It is a markdown-only change with no production code affected.

  • Proposes AudioContext / VideoContext config objects, a MultiModalContextT discriminated union for LLMTextColumnConfig, and a pre-validation migration for legacy image-only dict configs that omit the new modality field.
  • Defines a canonical block schema with adapter-level translation and capability gating; ImageColumnConfig.multi_modal_context intentionally stays list[ImageContext] | None.
  • Documents provider support boundaries: OpenAI-compatible adapters translate audio/video, Anthropic rejects audio/video with a canonical error in v1.

Confidence Score: 5/5

Safe to merge — markdown-only planning document with no production code changes.

The change is a single markdown file describing a design plan. The architecture is coherent with the existing codebase: the canonical block schema correctly uses only media_type, ImageColumnConfig.multi_modal_context stays image-only, and the legacy migration strategy mirrors existing patterns. No logic errors or correctness issues were found.

No files require special attention.

Important Files Changed

Filename Overview
plans/668/audio-video-context.md New planning document describing AudioContext/VideoContext API design, canonical block schema, provider translation boundaries, legacy migration strategy, and test plan. No executable code changed.

Sequence Diagram

sequenceDiagram
    participant User as User Config
    participant LLMText as LLMTextColumnConfig
    participant CtxObj as AudioContext / VideoContext
    participant Engine as ColumnGeneratorWithModel
    participant Prompt as prompt_to_messages()
    participant Facade as ModelFacade
    participant OAI as OpenAICompatibleClient
    participant Anth as AnthropicClient
    participant Provider as Provider API

    User->>LLMText: "multi_modal_context=[AudioContext(...), VideoContext(...), ImageContext(...)]"
    Note over LLMText: Pre-validator injects modality=image for legacy dicts
    LLMText->>CtxObj: discriminated-union validation via MultiModalContextT
    Engine->>CtxObj: _build_multi_modal_context(record, base_path)
    CtxObj-->>Engine: canonical blocks
    Engine->>Prompt: pass canonical blocks + text prompt
    Prompt-->>Facade: user message
    Facade->>OAI: ChatCompletionRequest
    OAI->>OAI: capability gate + translation
    OAI->>Provider: provider payload
    Facade->>Anth: ChatCompletionRequest
    Anth->>Anth: capability gate
    alt audio or video block
        Anth-->>Facade: ProviderError.unsupported_capability
    else image block
        Anth->>Provider: Claude image block
    end
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile

Comment thread plans/668/audio-video-context.md
Comment thread plans/668/audio-video-context.md
- OpenAI-compatible:
- Translate canonical `image` blocks to `image_url`.
- Translate canonical base64 `audio` blocks to `input_audio`.
- Translate supported `video` blocks to provider-specific media content parts only when the provider route supports them.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs one more concrete capability gate before implementation. Today DataDesigner only knows broad adapter capabilities like chat/image/embedding, but this plan needs modality/source/format decisions, like whether this provider/model accepts audio URL or video base64. Without that, unsupported media will likely fall through to raw provider 400s instead of the canonical unsupported error the plan is aiming for.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b152eca: added an adapter-level capability gate for modality, source type, and media type before transport, raising ProviderError.unsupported_capability / ProviderErrorKind.UNSUPPORTED_CAPABILITY for unsupported combinations.

Comment thread plans/668/audio-video-context.md Outdated
Comment thread plans/668/audio-video-context.md Outdated
Comment thread plans/668/audio-video-context.md
Comment thread plans/668/audio-video-context.md Outdated
nabinchha added 3 commits May 18, 2026 15:17
Tighten the plan around legacy image-context migration, audio/video auto-detection, config-layer canonical blocks, capability gating, ImageColumnConfig scope, and single-PR implementation rollout.

Refs #668
@nabinchha nabinchha requested a review from andreatgretel May 18, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plan audio and video context support

2 participants