Skip to content

Agentic, transcript-grounded, on-brand metadata authoring #43

Description

@jdwit

Summary

Add agentic authoring of video metadata to ytstudio: let an agent (via the skills/ytstudio skill) draft and update titles, descriptions, and tags grounded in what was actually said in the video, and shaped by a per-channel brand voice. The CLI gains the raw materials (transcript/caption fetch, brand-voice storage) and the skill gains the workflow guidance; the agent does the writing, the existing videos update path applies the change.

Motivation

Today the skill is great at mechanical bulk edits (videos search-replace, videos update), but writing good titles and descriptions is a creative, per-video task that an agent can do well only if it has two things it currently lacks:

  • the actual spoken content of the video (so the copy is accurate, not hallucinated), and
  • a consistent house style per channel (so output sounds on-brand instead of generic LLM voice).

With both available as first-class CLI/skill primitives, an agent can do grounded, on-brand metadata authoring at scale: backfill weak descriptions, rewrite titles for a topic, draft metadata for fresh uploads, all consistent across a channel.

Proposed scope

Three pillars. CLI shapes follow the existing ytstudio <group> <command> style, dry-run-by-default with --execute, and -o json for parseable output, matching src/ytstudio/commands/videos.py.

1. Transcript grounding (CLI provides the data)

A videos subcommand to pull the spoken content of a video so the agent can read it before writing:

ytstudio videos captions <video-id> -o json          # list available caption tracks
ytstudio videos transcript <video-id> [--lang nl] [-o text|json]   # download a track's text
  • Backed by the YouTube Data API captions.list (list tracks for a video) and captions.download (fetch the track body, e.g. tfmt=srt or sbv). These endpoints require the https://www.googleapis.com/auth/youtube.force-ssl scope, which this project already requests (see SCOPES in src/ytstudio/api.py). No new scope or re-consent is needed for caption read.
  • Output should default to a clean plain-text transcript (timestamps stripped) for transcript, with -o json exposing the track metadata (id, language, trackKind so the agent can tell standard from ASR/auto-generated, isDraft). captions lists the tracks so the agent can pick a language/kind.
  • Quota: caption list/download are cheap reads (~1 unit), consistent with the read costs already documented in the skill's Quota table.
  • Note on captions.download: the API only returns a track body the authenticated owner is allowed to download; for owned channels (the ytstudio use case) this is fine, but auto-generated (ASR) tracks can be restricted. The command should surface a clear message when a track exists but is not downloadable, rather than failing opaquely.

Fallback when no caption track exists (out of scope here, flag only): transcribing the audio with an external ASR/STT tool (e.g. a local Whisper). That pulls in a heavy non-YouTube dependency and a media-download step, so it should be a separate follow-up issue; this issue assumes captions/ASR tracks are the primary source.

2. Per-channel brand voice (CLI stores it, skill passes it)

Each channel is already a named profile under ~/.config/ytstudio-cli/profiles/<name>/, with credentials.json and a meta.json written by save_profile_meta / read by load_profile_meta (see src/ytstudio/config.py). Brand voice is per-channel config, so it belongs next to the profile.

Proposed: a brand.md (free-form Markdown) inside the profile directory, e.g. profiles/<name>/brand.md, holding the tone-of-voice / house-style description an agent should follow (audience, voice, do/don't, title conventions, description template, language). Markdown rather than a JSON field because the content is prose meant to be dropped into a system prompt verbatim; meta.json stays for structured key/values, and can hold a small pointer/flag if useful.

CLI surface, consistent with the existing profile group:

ytstudio profile brand show [--profile <name>]            # print the brand voice
ytstudio profile brand edit [--profile <name>]            # open $EDITOR on brand.md
ytstudio profile brand set --file path/to/brand.md        # set non-interactively

This reuses the profile resolution (get_active_profile, YTSTUDIO_PROFILE override) already in config.py, and the owner-only dir permissions (_ensure_private_dir, 0o700) already enforced there. A new brand_path(name) helper alongside credentials_path / _meta_path keeps it in one place.

3. Agentic authoring (skill ties it together)

No new "AI" command is required: the agent is the author, and it applies its result through the existing videos update <video-id> --title ... --description ... --tags ... (dry-run, then --execute) in src/ytstudio/commands/videos.py. The work here is skill guidance, a new authoring section in skills/ytstudio/SKILL.md, instructing the agent to:

  1. Read the brand voice (profile brand show) and treat it as house style / system context.
  2. Fetch grounding (videos transcript <id>, falling back to videos captions to choose a track); if no usable transcript, tell the user rather than inventing content.
  3. Read current metadata (videos show <id> -o json).
  4. Draft new title/description/tags grounded in the transcript and shaped by the brand voice.
  5. Preview with videos update (dry-run), show the user, then --execute, per the existing two-rules-first section of the skill.

The skill should be explicit that the transcript is the source of truth for claims and the brand file is the source of truth for tone, and that the agent must never fabricate content the transcript does not support.

Open questions / decisions

  • Transcript source priority: owner caption tracks first, ASR/auto-generated as fallback, external STT as a later, separate dependency. Confirm this ordering and whether ASR should be opt-in.
  • Brand-voice format/location: profiles/<name>/brand.md (free-form prose) vs a structured field in meta.json. Proposal favors brand.md for verbatim prompt use; confirm.
  • Metadata fields in scope: title and description for sure; tags likely; category and localizations probably out of scope for v1. Confirm the field set.
  • Command placement: videos transcript / videos captions under the videos group, and brand under profile brand vs a flatter shape. Confirm naming.
  • Caption download edge cases: tracks that exist but are not downloadable (restricted ASR), isDraft tracks, multi-language channels. Decide default language selection (active profile default vs explicit --lang).
  • Quota: reads are cheap; no concern for v1. The external-STT fallback would add real cost/time and stays out of scope.

Suggested phasing

  • Phase 1 (MVP): videos captions + videos transcript (Data API, existing scope) and profile brand show/edit/set storing profiles/<name>/brand.md. Skill section documenting the manual loop: brand show -> transcript -> show -> draft -> update dry-run -> execute. This is enough for an agent to author one video's metadata end to end.
  • Phase 2: skill polish and recommended workflow for bulk authoring (e.g. backfill descriptions for a set of videos), language/track selection guidance, and clear messaging for missing/undownloadable captions.
  • Phase 3 (separate dependency): external ASR/STT fallback for videos with no usable caption track (audio download + transcription), gated behind an optional dependency so the core install stays light.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions