Summary
Add agentic authoring of video metadata to ytstudio: let an agent (via the skills/ytstudio skill) draft and update titles, descriptions, and tags grounded in what was actually said in the video, and shaped by a per-channel brand voice. The CLI gains the raw materials (transcript/caption fetch, brand-voice storage) and the skill gains the workflow guidance; the agent does the writing, the existing videos update path applies the change.
Motivation
Today the skill is great at mechanical bulk edits (videos search-replace, videos update), but writing good titles and descriptions is a creative, per-video task that an agent can do well only if it has two things it currently lacks:
- the actual spoken content of the video (so the copy is accurate, not hallucinated), and
- a consistent house style per channel (so output sounds on-brand instead of generic LLM voice).
With both available as first-class CLI/skill primitives, an agent can do grounded, on-brand metadata authoring at scale: backfill weak descriptions, rewrite titles for a topic, draft metadata for fresh uploads, all consistent across a channel.
Proposed scope
Three pillars. CLI shapes follow the existing ytstudio <group> <command> style, dry-run-by-default with --execute, and -o json for parseable output, matching src/ytstudio/commands/videos.py.
1. Transcript grounding (CLI provides the data)
A videos subcommand to pull the spoken content of a video so the agent can read it before writing:
ytstudio videos captions <video-id> -o json # list available caption tracks
ytstudio videos transcript <video-id> [--lang nl] [-o text|json] # download a track's text
- Backed by the YouTube Data API
captions.list (list tracks for a video) and captions.download (fetch the track body, e.g. tfmt=srt or sbv). These endpoints require the https://www.googleapis.com/auth/youtube.force-ssl scope, which this project already requests (see SCOPES in src/ytstudio/api.py). No new scope or re-consent is needed for caption read.
- Output should default to a clean plain-text transcript (timestamps stripped) for
transcript, with -o json exposing the track metadata (id, language, trackKind so the agent can tell standard from ASR/auto-generated, isDraft). captions lists the tracks so the agent can pick a language/kind.
- Quota: caption list/download are cheap reads (~1 unit), consistent with the read costs already documented in the skill's Quota table.
- Note on
captions.download: the API only returns a track body the authenticated owner is allowed to download; for owned channels (the ytstudio use case) this is fine, but auto-generated (ASR) tracks can be restricted. The command should surface a clear message when a track exists but is not downloadable, rather than failing opaquely.
Fallback when no caption track exists (out of scope here, flag only): transcribing the audio with an external ASR/STT tool (e.g. a local Whisper). That pulls in a heavy non-YouTube dependency and a media-download step, so it should be a separate follow-up issue; this issue assumes captions/ASR tracks are the primary source.
2. Per-channel brand voice (CLI stores it, skill passes it)
Each channel is already a named profile under ~/.config/ytstudio-cli/profiles/<name>/, with credentials.json and a meta.json written by save_profile_meta / read by load_profile_meta (see src/ytstudio/config.py). Brand voice is per-channel config, so it belongs next to the profile.
Proposed: a brand.md (free-form Markdown) inside the profile directory, e.g. profiles/<name>/brand.md, holding the tone-of-voice / house-style description an agent should follow (audience, voice, do/don't, title conventions, description template, language). Markdown rather than a JSON field because the content is prose meant to be dropped into a system prompt verbatim; meta.json stays for structured key/values, and can hold a small pointer/flag if useful.
CLI surface, consistent with the existing profile group:
ytstudio profile brand show [--profile <name>] # print the brand voice
ytstudio profile brand edit [--profile <name>] # open $EDITOR on brand.md
ytstudio profile brand set --file path/to/brand.md # set non-interactively
This reuses the profile resolution (get_active_profile, YTSTUDIO_PROFILE override) already in config.py, and the owner-only dir permissions (_ensure_private_dir, 0o700) already enforced there. A new brand_path(name) helper alongside credentials_path / _meta_path keeps it in one place.
3. Agentic authoring (skill ties it together)
No new "AI" command is required: the agent is the author, and it applies its result through the existing videos update <video-id> --title ... --description ... --tags ... (dry-run, then --execute) in src/ytstudio/commands/videos.py. The work here is skill guidance, a new authoring section in skills/ytstudio/SKILL.md, instructing the agent to:
- Read the brand voice (
profile brand show) and treat it as house style / system context.
- Fetch grounding (
videos transcript <id>, falling back to videos captions to choose a track); if no usable transcript, tell the user rather than inventing content.
- Read current metadata (
videos show <id> -o json).
- Draft new title/description/tags grounded in the transcript and shaped by the brand voice.
- Preview with
videos update (dry-run), show the user, then --execute, per the existing two-rules-first section of the skill.
The skill should be explicit that the transcript is the source of truth for claims and the brand file is the source of truth for tone, and that the agent must never fabricate content the transcript does not support.
Open questions / decisions
- Transcript source priority: owner caption tracks first, ASR/auto-generated as fallback, external STT as a later, separate dependency. Confirm this ordering and whether ASR should be opt-in.
- Brand-voice format/location:
profiles/<name>/brand.md (free-form prose) vs a structured field in meta.json. Proposal favors brand.md for verbatim prompt use; confirm.
- Metadata fields in scope: title and description for sure; tags likely; category and localizations probably out of scope for v1. Confirm the field set.
- Command placement:
videos transcript / videos captions under the videos group, and brand under profile brand vs a flatter shape. Confirm naming.
- Caption download edge cases: tracks that exist but are not downloadable (restricted ASR),
isDraft tracks, multi-language channels. Decide default language selection (active profile default vs explicit --lang).
- Quota: reads are cheap; no concern for v1. The external-STT fallback would add real cost/time and stays out of scope.
Suggested phasing
- Phase 1 (MVP):
videos captions + videos transcript (Data API, existing scope) and profile brand show/edit/set storing profiles/<name>/brand.md. Skill section documenting the manual loop: brand show -> transcript -> show -> draft -> update dry-run -> execute. This is enough for an agent to author one video's metadata end to end.
- Phase 2: skill polish and recommended workflow for bulk authoring (e.g. backfill descriptions for a set of videos), language/track selection guidance, and clear messaging for missing/undownloadable captions.
- Phase 3 (separate dependency): external ASR/STT fallback for videos with no usable caption track (audio download + transcription), gated behind an optional dependency so the core install stays light.
Summary
Add agentic authoring of video metadata to ytstudio: let an agent (via the
skills/ytstudioskill) draft and update titles, descriptions, and tags grounded in what was actually said in the video, and shaped by a per-channel brand voice. The CLI gains the raw materials (transcript/caption fetch, brand-voice storage) and the skill gains the workflow guidance; the agent does the writing, the existingvideos updatepath applies the change.Motivation
Today the skill is great at mechanical bulk edits (
videos search-replace,videos update), but writing good titles and descriptions is a creative, per-video task that an agent can do well only if it has two things it currently lacks:With both available as first-class CLI/skill primitives, an agent can do grounded, on-brand metadata authoring at scale: backfill weak descriptions, rewrite titles for a topic, draft metadata for fresh uploads, all consistent across a channel.
Proposed scope
Three pillars. CLI shapes follow the existing
ytstudio <group> <command>style, dry-run-by-default with--execute, and-o jsonfor parseable output, matchingsrc/ytstudio/commands/videos.py.1. Transcript grounding (CLI provides the data)
A
videossubcommand to pull the spoken content of a video so the agent can read it before writing:captions.list(list tracks for a video) andcaptions.download(fetch the track body, e.g.tfmt=srtorsbv). These endpoints require thehttps://www.googleapis.com/auth/youtube.force-sslscope, which this project already requests (seeSCOPESinsrc/ytstudio/api.py). No new scope or re-consent is needed for caption read.transcript, with-o jsonexposing the track metadata (id,language,trackKindso the agent can tellstandardfromASR/auto-generated,isDraft).captionslists the tracks so the agent can pick a language/kind.captions.download: the API only returns a track body the authenticated owner is allowed to download; for owned channels (the ytstudio use case) this is fine, but auto-generated (ASR) tracks can be restricted. The command should surface a clear message when a track exists but is not downloadable, rather than failing opaquely.Fallback when no caption track exists (out of scope here, flag only): transcribing the audio with an external ASR/STT tool (e.g. a local Whisper). That pulls in a heavy non-YouTube dependency and a media-download step, so it should be a separate follow-up issue; this issue assumes captions/ASR tracks are the primary source.
2. Per-channel brand voice (CLI stores it, skill passes it)
Each channel is already a named profile under
~/.config/ytstudio-cli/profiles/<name>/, withcredentials.jsonand ameta.jsonwritten bysave_profile_meta/ read byload_profile_meta(seesrc/ytstudio/config.py). Brand voice is per-channel config, so it belongs next to the profile.Proposed: a
brand.md(free-form Markdown) inside the profile directory, e.g.profiles/<name>/brand.md, holding the tone-of-voice / house-style description an agent should follow (audience, voice, do/don't, title conventions, description template, language). Markdown rather than a JSON field because the content is prose meant to be dropped into a system prompt verbatim;meta.jsonstays for structured key/values, and can hold a small pointer/flag if useful.CLI surface, consistent with the existing
profilegroup:This reuses the profile resolution (
get_active_profile,YTSTUDIO_PROFILEoverride) already inconfig.py, and the owner-only dir permissions (_ensure_private_dir, 0o700) already enforced there. A newbrand_path(name)helper alongsidecredentials_path/_meta_pathkeeps it in one place.3. Agentic authoring (skill ties it together)
No new "AI" command is required: the agent is the author, and it applies its result through the existing
videos update <video-id> --title ... --description ... --tags ...(dry-run, then--execute) insrc/ytstudio/commands/videos.py. The work here is skill guidance, a new authoring section inskills/ytstudio/SKILL.md, instructing the agent to:profile brand show) and treat it as house style / system context.videos transcript <id>, falling back tovideos captionsto choose a track); if no usable transcript, tell the user rather than inventing content.videos show <id> -o json).videos update(dry-run), show the user, then--execute, per the existing two-rules-first section of the skill.The skill should be explicit that the transcript is the source of truth for claims and the brand file is the source of truth for tone, and that the agent must never fabricate content the transcript does not support.
Open questions / decisions
profiles/<name>/brand.md(free-form prose) vs a structured field inmeta.json. Proposal favorsbrand.mdfor verbatim prompt use; confirm.videos transcript/videos captionsunder thevideosgroup, and brand underprofile brandvs a flatter shape. Confirm naming.isDrafttracks, multi-language channels. Decide default language selection (active profile default vs explicit--lang).Suggested phasing
videos captions+videos transcript(Data API, existing scope) andprofile brand show/edit/setstoringprofiles/<name>/brand.md. Skill section documenting the manual loop: brand show -> transcript -> show -> draft -> update dry-run -> execute. This is enough for an agent to author one video's metadata end to end.