Agentic, transcript-grounded, on-brand metadata authoring

## Summary

Add agentic authoring of video metadata to ytstudio: let an agent (via the `skills/ytstudio` skill) draft and update titles, descriptions, and tags grounded in what was actually said in the video, and shaped by a per-channel brand voice. The CLI gains the raw materials (transcript/caption fetch, brand-voice storage) and the skill gains the workflow guidance; the agent does the writing, the existing `videos update` path applies the change.

## Motivation

Today the skill is great at mechanical bulk edits (`videos search-replace`, `videos update`), but writing good titles and descriptions is a creative, per-video task that an agent can do well only if it has two things it currently lacks:

- the actual spoken content of the video (so the copy is accurate, not hallucinated), and
- a consistent house style per channel (so output sounds on-brand instead of generic LLM voice).

With both available as first-class CLI/skill primitives, an agent can do grounded, on-brand metadata authoring at scale: backfill weak descriptions, rewrite titles for a topic, draft metadata for fresh uploads, all consistent across a channel.

## Proposed scope

Three pillars. CLI shapes follow the existing `ytstudio <group> <command>` style, dry-run-by-default with `--execute`, and `-o json` for parseable output, matching `src/ytstudio/commands/videos.py`.

### 1. Transcript grounding (CLI provides the data)

A `videos` subcommand to pull the spoken content of a video so the agent can read it before writing:

```
ytstudio videos captions <video-id> -o json          # list available caption tracks
ytstudio videos transcript <video-id> [--lang nl] [-o text|json]   # download a track's text
```

- Backed by the YouTube Data API `captions.list` (list tracks for a video) and `captions.download` (fetch the track body, e.g. `tfmt=srt` or `sbv`). These endpoints require the `https://www.googleapis.com/auth/youtube.force-ssl` scope, which this project already requests (see `SCOPES` in `src/ytstudio/api.py`). No new scope or re-consent is needed for caption read.
- Output should default to a clean plain-text transcript (timestamps stripped) for `transcript`, with `-o json` exposing the track metadata (`id`, `language`, `trackKind` so the agent can tell `standard` from `ASR`/auto-generated, `isDraft`). `captions` lists the tracks so the agent can pick a language/kind.
- Quota: caption list/download are cheap reads (~1 unit), consistent with the read costs already documented in the skill's Quota table.
- Note on `captions.download`: the API only returns a track body the authenticated owner is allowed to download; for owned channels (the ytstudio use case) this is fine, but auto-generated (ASR) tracks can be restricted. The command should surface a clear message when a track exists but is not downloadable, rather than failing opaquely.

Fallback when no caption track exists (out of scope here, flag only): transcribing the audio with an external ASR/STT tool (e.g. a local Whisper). That pulls in a heavy non-YouTube dependency and a media-download step, so it should be a separate follow-up issue; this issue assumes captions/ASR tracks are the primary source.

### 2. Per-channel brand voice (CLI stores it, skill passes it)

Each channel is already a named profile under `~/.config/ytstudio-cli/profiles/<name>/`, with `credentials.json` and a `meta.json` written by `save_profile_meta` / read by `load_profile_meta` (see `src/ytstudio/config.py`). Brand voice is per-channel config, so it belongs next to the profile.

Proposed: a `brand.md` (free-form Markdown) inside the profile directory, e.g. `profiles/<name>/brand.md`, holding the tone-of-voice / house-style description an agent should follow (audience, voice, do/don't, title conventions, description template, language). Markdown rather than a JSON field because the content is prose meant to be dropped into a system prompt verbatim; `meta.json` stays for structured key/values, and can hold a small pointer/flag if useful.

CLI surface, consistent with the existing `profile` group:

```
ytstudio profile brand show [--profile <name>]            # print the brand voice
ytstudio profile brand edit [--profile <name>]            # open $EDITOR on brand.md
ytstudio profile brand set --file path/to/brand.md        # set non-interactively
```

This reuses the profile resolution (`get_active_profile`, `YTSTUDIO_PROFILE` override) already in `config.py`, and the owner-only dir permissions (`_ensure_private_dir`, 0o700) already enforced there. A new `brand_path(name)` helper alongside `credentials_path` / `_meta_path` keeps it in one place.

### 3. Agentic authoring (skill ties it together)

No new "AI" command is required: the agent is the author, and it applies its result through the existing `videos update <video-id> --title ... --description ... --tags ...` (dry-run, then `--execute`) in `src/ytstudio/commands/videos.py`. The work here is skill guidance, a new authoring section in `skills/ytstudio/SKILL.md`, instructing the agent to:

1. Read the brand voice (`profile brand show`) and treat it as house style / system context.
2. Fetch grounding (`videos transcript <id>`, falling back to `videos captions` to choose a track); if no usable transcript, tell the user rather than inventing content.
3. Read current metadata (`videos show <id> -o json`).
4. Draft new title/description/tags grounded in the transcript and shaped by the brand voice.
5. Preview with `videos update` (dry-run), show the user, then `--execute`, per the existing two-rules-first section of the skill.

The skill should be explicit that the transcript is the source of truth for claims and the brand file is the source of truth for tone, and that the agent must never fabricate content the transcript does not support.

## Open questions / decisions

- Transcript source priority: owner caption tracks first, ASR/auto-generated as fallback, external STT as a later, separate dependency. Confirm this ordering and whether ASR should be opt-in.
- Brand-voice format/location: `profiles/<name>/brand.md` (free-form prose) vs a structured field in `meta.json`. Proposal favors `brand.md` for verbatim prompt use; confirm.
- Metadata fields in scope: title and description for sure; tags likely; category and localizations probably out of scope for v1. Confirm the field set.
- Command placement: `videos transcript` / `videos captions` under the `videos` group, and brand under `profile brand` vs a flatter shape. Confirm naming.
- Caption download edge cases: tracks that exist but are not downloadable (restricted ASR), `isDraft` tracks, multi-language channels. Decide default language selection (active profile default vs explicit `--lang`).
- Quota: reads are cheap; no concern for v1. The external-STT fallback would add real cost/time and stays out of scope.

## Suggested phasing

- Phase 1 (MVP): `videos captions` + `videos transcript` (Data API, existing scope) and `profile brand show/edit/set` storing `profiles/<name>/brand.md`. Skill section documenting the manual loop: brand show -> transcript -> show -> draft -> update dry-run -> execute. This is enough for an agent to author one video's metadata end to end.
- Phase 2: skill polish and recommended workflow for bulk authoring (e.g. backfill descriptions for a set of videos), language/track selection guidance, and clear messaging for missing/undownloadable captions.
- Phase 3 (separate dependency): external ASR/STT fallback for videos with no usable caption track (audio download + transcription), gated behind an optional dependency so the core install stays light.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentic, transcript-grounded, on-brand metadata authoring #43

Summary

Motivation

Proposed scope

1. Transcript grounding (CLI provides the data)

2. Per-channel brand voice (CLI stores it, skill passes it)

3. Agentic authoring (skill ties it together)

Open questions / decisions

Suggested phasing

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Agentic, transcript-grounded, on-brand metadata authoring #43

Description

Summary

Motivation

Proposed scope

1. Transcript grounding (CLI provides the data)

2. Per-channel brand voice (CLI stores it, skill passes it)

3. Agentic authoring (skill ties it together)

Open questions / decisions

Suggested phasing

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions