feat(skills): add landing zone discovery skill with cross-shell scripts by arnaudlh · Pull Request #177 · Azure/git-ape

arnaudlh · 2026-06-15T02:35:13Z

Summary

Adds the azure-landing-zone-discovery skill, which auto-discovers enterprise Azure landing zone topology (management groups, platform vs. application subscriptions, policy assignments, hub-spoke networking, and shared services) and writes a machine-readable .azure/landing-zone-context.json. Git-Ape agents read that context to route workloads to the correct subscription, connect to shared services, and avoid policy conflicts.

What's included

Skill — .github/skills/azure-landing-zone-discovery/SKILL.md (user-invocable), with a weighted ALZ confidence scorer (high/medium/low/none) referencing the ALZ accelerator.
Cross-shell helper scripts — discover-lz and inject-lz ship in both bash (.sh) and PowerShell (.ps1) parity ports. Both produce a byte-compatible landing-zone-context.json, so Windows users without git-bash are first-class.
Evals — eval.yaml plus 4 tasks (2 positive, 2 negative) registered in .github/evals/manifest.yaml.
Fixtures — tests/fixtures/landing-zone/ (flat tenant, hub-spoke tenant, skipped-network).
Agent integration — landing-zone context wired into the requirements gatherer, template generator, policy advisor, onboarding, and the git-ape agent, plus copilot-instructions.md.
Authoring convention — website/docs/authoring/skills.md now documents the dual-shell helper-script rule (user-invocable skills SHOULD ship .sh+.ps1; CI-only skills MAY ship .sh only).
Docs — generated skill doc, a deployment reference (landing-zone-context.md), and a use-case walkthrough (landing-zone-aware-deployment.md).

Validation

Both .sh ports pass bash -n; both .ps1 ports pass the PowerShell language parser.
inject-lz.ps1 functionally tested for schema parity vs inject-lz.sh, including cross-shell -Merge (bash creates base → pwsh merges correctly).
Docs regenerated via scripts/generate-docs.js; unrelated drift reverted to keep the PR scoped.

🧭 - Generated by Copilot

- add azure-landing-zone-discovery skill, evals, fixtures, and docs - ship discover-lz/inject-lz in both bash and PowerShell parity ports - document dual-shell helper-script convention in authoring docs - wire landing-zone context into agents and copilot-instructions 🧭 - Generated by Copilot

github-actions · 2026-06-15T02:35:44Z

⚠️ Documentation Staleness Warning

Source files (agents, skills, workflows, or config) changed in this PR, but the generated documentation is out of date.

Changed docs that need regeneration:

website/docs/workflows/daily-repo-status-lock.md
website/docs/workflows/git-ape-actionlint.md
website/docs/workflows/git-ape-build.md
website/docs/workflows/git-ape-deck-build.md
website/docs/workflows/git-ape-docs-check.md
website/docs/workflows/git-ape-docs.md
website/docs/workflows/git-ape-onboarding-template-check.md
website/docs/workflows/git-ape-plugin-version-check.md
website/docs/workflows/git-ape-release.md
website/docs/workflows/git-ape-script-lint.md
website/docs/workflows/git-ape-workshop-content-updater-lock.md
website/docs/workflows/git-ape-workshop-sync.md
website/docs/workflows/issue-triage-agent-lock.md
website/docs/workflows/pr-validation.md
website/docs/workflows/waza-agent-evals.md
website/docs/workflows/waza-evals.md
website/docs/workflows/workshop-quality-check.md

To fix: Run the following command and commit the results:

node scripts/generate-docs.js

This is an advisory check — it does not block the PR.

github-actions · 2026-06-15T02:35:54Z

🤖 Waza agent evals (advisory)

ℹ️ No agents evaluated. changed agent(s) have no eval directory: azure-policy-advisor azure-requirements-gatherer azure-template-generator git-ape git-ape-onboarding

Ran 0 agent evals against claude-sonnet-4.6. Each eval consumes ~5 premium Copilot requests; results are non-blocking — investigate failures via the workflow logs and the per-agent waza-agent-results-* artifacts.

How this works: This workflow auto-syncs the canonical .github/agents/<name>.agent.md into the sibling mirror inside .github/evals/agents/<name>/ before each run, so the score below reflects the version of the agent in this PR — not whatever was committed when the eval was first wired up.

📊 Agent file token comparison vs main (advisory)

No .agent.md files changed vs main (or token-compare returned no entries).

No agents in scope for this PR.

github-actions · 2026-06-15T02:42:23Z

🧪 Waza skill evals (advisory)

🔁 Full matrix run. project-wide config change (.waza.yaml, manifest, or workflow file) → full matrix

Ran 16 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg waza-results-* artifacts.

Legend: Models flagged baseline: true in .github/evals/manifest.yaml (currently: gpt-5.4) run with --baseline (A/B mode) to cap quota. All other models run standard. Judge model is fixed at claude-opus-4.7 across all legs.

📊 Token comparison vs main (advisory)

{
  "baseRef": "main",
  "headRef": "WORKING",
  "threshold": 10,
  "passed": true,
  "timestamp": "2026-06-25T04:47:30.208471014Z",
  "summary": {
    "totalBefore": 0,
    "totalAfter": 47849,
    "totalDiff": 47849,
    "percentChange": 100,
    "filesAdded": 16,
    "filesRemoved": 0,
    "filesModified": 0,
    "filesIncreased": 16,
    "filesDecreased": 0
  },
  "files": [
    {
      "file": ".github/skills/azure-cost-estimator/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3231,
        "characters": 11940,
        "lines": 345
      },
      "diff": 3231,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-deployment-preflight/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1448,
        "characters": 6281,
        "lines": 212
      },
      "diff": 1448,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-drift-detector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3179,
        "characters": 13149,
        "lines": 460
      },
      "diff": 3179,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-integration-tester/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1563,
        "characters": 6807,
        "lines": 248
      },
      "diff": 1563,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-landing-zone-discovery/SKILL.md",
      "before": null,
      "after": {
        "tokens": 7117,
        "characters": 29287,
        "lines": 593
      },
      "diff": 7117,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-naming-research/SKILL.md",
      "before": null,
      "after": {
        "tokens": 486,
        "characters": 2108,
        "lines": 44
      },
      "diff": 486,
      "percentChange": 100,
      "status": "added",
      "limit": 500
    },
    {
      "file": ".github/skills/azure-policy-advisor/SKILL.md",
      "before": null,
      "after": {
        "tokens": 5118,
        "characters": 22803,
        "lines": 389
      },
      "diff": 5118,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-availability/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2413,
        "characters": 9881,
        "lines": 308
      },
      "diff": 2413,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-visualizer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1494,
        "characters": 6179,
        "lines": 192
      },
      "diff": 1494,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-rest-api-reference/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1831,
        "characters": 8430,
        "lines": 200
      },
      "diff": 1831,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-role-selector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1280,
        "characters": 5641,
        "lines": 162
      },
      "diff": 1280,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-security-analyzer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 5326,
        "characters": 21419,
        "lines": 451
      },
      "diff": 5326,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-stack-deploy/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1912,
        "characters": 7525,
        "lines": 159
      },
      "diff": 1912,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-stack-destroy/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2644,
        "characters": 10670,
        "lines": 180
      },
      "diff": 2644,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/git-ape-onboarding/SKILL.md",
      "before": null,
      "after": {
        "tokens": 6667,
        "characters": 27794,
        "lines": 531
      },
      "diff": 6667,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/prereq-check/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2140,
        "characters": 8023,
        "lines": 147
      },
      "diff": 2140,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    }
  ]
}

Skill: `prereq-check`

📈 Score (per model) + Suggestions/Recommendations

Model: claude-opus-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 3C23:3379D4:1C2EED3:1F4EA88:6A3CB2F9)

✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✗ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 1m27.868s

Tests: 4 total, 3 passed, 1 failed, 0 errors
Success Rate: 75.0%
Score Range: 0.57 - 1.00 (σ=0.1839)

Task Results

Task	Score	Status	Graders
Negative — Editing an ARM template	0.57	✅	budget, trigger_relevance_negative
Negative — Azure service concept question	0.60	✅	budget, trigger_relevance_negative
Positive — "command not found" failure	0.89	❌	answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?"	1.00	✅	answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

Positive — "command not found" failure: 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "command not found" failure

Run 1/3 (error):

❌ answer_quality (0.00): fail: No previous assistant response exists to grade: There is no prior assistant response in this session to evaluate. The conversation only contains the user's question and the grading instruction — none of the four PASS criteria (naming az/gh/jq/git, install command, version verification, verdict/next step) can be satisfied because no response was produced.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6

Results saved to: .waza-results/prereq-check-claude-opus-4.6.json

Model: claude-sonnet-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
[ERROR] waiting for session.idle: context deadline exceeded

✓ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 1m54.666s

Tests: 4 total, 3 passed, 1 failed, 0 errors
Success Rate: 75.0%
Score Range: 0.57 - 1.00 (σ=0.1839)

Task Results

Task	Score	Status	Graders
Negative — Editing an ARM template	0.57	✅	budget, trigger_relevance_negative
Negative — Azure service concept question	0.60	✅	budget, trigger_relevance_negative
Positive — "command not found" failure	0.89	❌	answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?"	1.00	✅	answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

Positive — "command not found" failure: 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "command not found" failure

Run 1/3 (error):

❌ answer_quality (0.00): fail: No previous assistant response exists to grade: There is no prior assistant response in this session to evaluate. None of the four PASS criteria can be met: (1) no tools named, (2) no install command provided, (3) no version verification recommended, (4) no verdict/next step given.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6

Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] waiting for session.idle: context deadline exceeded

✓ [2/4] Negative — Azure service concept question
✗ [1/4] Negative — Editing an ARM template
✓ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.79 | Duration: 1m30.15s

Tests: 4 total, 3 passed, 1 failed, 0 errors
Success Rate: 75.0%
Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task	Score	Status	Graders
Negative — Editing an ARM template	0.57	❌	budget, trigger_relevance_negative
Negative — Azure service concept question	0.60	✅	budget, trigger_relevance_negative
Positive — "command not found" failure	1.00	✅	answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?"	1.00	✅	answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

Negative — Editing an ARM template: 67% pass rate, score=0.57±0.00

Failed Task Details

Negative — Editing an ARM template

Run 1/3 (error):

✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_negative (0.14): Prompt correctly treated as non-trigger (score 0.14 < 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.3-codex

Results saved to: .waza-results/prereq-check-gpt-5.3-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 8C11:3E4A2D:1CB8715:1FC6DDC:6A3CB312)

✓ [2/4] Negative — Azure service concept question
✗ [1/4] Negative — Editing an ARM template
✓ [4/4] Positive — "What do I need to install?"
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 8C12:3E4A2D:1CC150E:1FD0981:6A3CB32F)

✗ [3/4] Positive — "command not found" failure

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✗ [3/4] Positive — "command not found" failure
✓ [2/4] Negative — Azure service concept question
✗ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 50.0% (2/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: no change

Per-Task Breakdown:
• Negative — Editing an ARM template [REGRESSED] 100% → 67% (-33pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [IMPROVED] 0% → 67% (+67pp)
• Positive — "What do I need to install?" [IMPROVED] 67% → 100% (+33pp)

Verdict: Skills have NEUTRAL IMPACT (no net change)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 1m15.231s

Tests: 4 total, 2 passed, 2 failed, 0 errors
Success Rate: 50.0%
Score Range: 0.57 - 1.00 (σ=0.1839)

Task Results

Task	Score	Status	Graders
Negative — Editing an ARM template	0.57	❌	budget, trigger_relevance_negative
Negative — Azure service concept question	0.60	✅	budget, trigger_relevance_negative
Positive — "command not found" failure	0.89	❌	answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?"	1.00	✅	answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

Negative — Editing an ARM template: 67% pass rate, score=0.57±0.00
Positive — "command not found" failure: 67% pass rate, score=0.89±0.16

Failed Task Details

Negative — Editing an ARM template

Run 2/3 (error):

✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_negative (0.14): Prompt correctly treated as non-trigger (score 0.14 < 0.50)

Positive — "command not found" failure

Run 3/3 (error):

❌ answer_quality (0.00): fail: Previous response did not deliver a final answer: The previous assistant turn only ran tool calls (platform detection, check-tools.sh, viewing install-commands.md) with a one-line progress note ("Checking the local prerequisite status and the Linux install guidance now."). It never produced the user-facing summary, so none of the four PASS criteria are met: (1) the tools az/gh/jq/git are not enumerated in a response to the user, (2) no install command for az is presented, (3) no version-verification step is recommended, and (4) no verdict / next step is emitted. The information was gathered but not delivered.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4

Results saved to: .waza-results/prereq-check-gpt-5.4.json

🔢 Tokens (count + profile)

📊 prereq-check: 2,140 tokens (detailed ✓), 10 sections, 2 code blocks

🎯 Quality (5-dim table)

time=2026-06-25T04:49:49.419Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Instructions are exceptionally clear with well-structured tables, numbered steps, explicit status mappings, and platform-specific code blocks. The purpose is immediately obvious from the frontmatter description alone.
completeness       █████  Covers tool checks, version thresholds, auth sessions, platform detection, error handling for 8 distinct failure modes, and even edge cases like permission-denied scripts and execution policy restrictions. Nothing obvious is missing.
trigger_precision  ████░  USE FOR triggers are rich with concrete error string patterns (e.g., 'az: command not found'), which is excellent for routing. DO NOT USE FOR is definitive but terse — adding one or two anti-example scenarios (e.g., 'do not use to validate ARM templates') would reduce ambiguity at the margin.
scope_coverage     █████  Scope is tightly and explicitly bounded: read-only, 4 specific tools, 2 auth sessions, clear handoff to related skills. The 'Never' constraints list and the explicit 'Side effects: Read-only' quick-reference entry leave no ambiguity about boundaries.
anti_patterns      ████░  Avoids nearly all anti-patterns: no vague verbs, no conflicting directives, solid error handling table. Minor gap: the step table references external scripts (check-tools.sh, install-commands.md) without a fallback if those files are absent, which could leave the agent stuck in a fresh-clone scenario.
────────────────────────────────────────────
Overall: 4.6/5.0

A high-quality, production-ready skill definition. It is unusually thorough in error handling, platform coverage, and boundary-setting. The only meaningful improvement would be adding a graceful fallback for missing reference scripts and slightly expanding the DO NOT USE FOR section with concrete counter-examples.

✅ Check (compliance summary) (56 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: prereq-check

📋 Compliance Score: Medium-High
   ⚠️  Good, but could be improved. Missing routing clarity.

   Issues found:
   ❌  SKILL.md is 2140 tokens (hard limit 500)

📐 Spec Compliance: 9/9 checks passed
   ✅  Meets agentskills.io specification.

📎 Links: 4/4 valid
   ✅  All links valid.

📊 Token Budget: 2140 / 500 tokens
   ❌  Exceeds limit by 1640 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 1 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2140 tokens, 1 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ❌  [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Reduce SKILL.md by 1640 tokens. Run 'waza tokens suggest' for optimization tips

Skill: `git-ape-onboarding`

📈 Score (per model) + Suggestions/Recommendations

Model: claude-opus-4.6

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✓ [2/4] Positive — First-time repo setup

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.79 | Duration: 38.278s

Tests: 4 total, 3 passed, 1 failed, 0 errors
Success Rate: 75.0%
Score Range: 0.56 - 1.00 (σ=0.2022)

Task Results

Task	Score	Status	Graders
Negative — Storage service comparison (off-topic)	0.56	✅	budget, trigger_relevance_negative
Positive — First-time repo setup	1.00	✅	answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding	0.62	❌	answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision	0.98	✅	answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — Multi-environment onboarding

Run 1/1 (failed):

❌ answer_quality (0.00): fail: Missing prereq check and auth gate: Criterion 1 FAIL: No prerequisite check was performed or presented. The assistant did not inspect the local environment (no az --version, gh --version, az account show, gh auth status, or equivalent tool/auth status table). It jumped straight into prescriptive command snippets.

Criterion 2 FAIL: No auth/prereq gate was surfaced. Since no inspection occurred, the assistant could not have surfaced a blocking auth state (e.g., "Azure CLI not authenticated, run az login"). The response assumes auth is ready without verifying.

Criterion 3 PASS: The assistant requested 4 inputs (GitHub repo URL, staging subscription ID, existing App Registration client ID, RBAC role) — meets the ≥3 threshold.

Criterion 4 PASS: Multi-environment awareness is demonstrated — explicitly creates a new federated credential scoped to repo:<org>/<repo>:environment:azure-deploy-staging, names the new GitHub environment azure-deploy-staging, and sets a per-environment AZURE_SUBSCRIPTION_ID variable scoped to that environment.

Overall: Response acted as a "how-to guide + input request" but skipped the gated prereq-inspection step the skill requires before any state-changing flow.

✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.87): Prompt is trigger-aligned (score 0.87 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-opus-4.6

Results saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.json

Model: claude-sonnet-4.6

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

✓ [1/4] Negative — Storage service comparison (off-topic)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 880B:564D3:1C47375:1F56037:6A3CB2F6)

✗ [3/4] Positive — Multi-environment onboarding
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.79 | Duration: 35.456s

Tests: 4 total, 3 passed, 1 failed, 0 errors
Success Rate: 75.0%
Score Range: 0.56 - 1.00 (σ=0.2022)

Task Results

Task	Score	Status	Graders
Negative — Storage service comparison (off-topic)	0.56	✅	budget, trigger_relevance_negative
Positive — First-time repo setup	1.00	✅	answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding	0.62	❌	answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision	0.98	✅	answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — Multi-environment onboarding

Run 1/1 (error):

❌ answer_quality (0.00): fail: No prior assistant response to grade: There is no visible previous assistant response in this session to evaluate. As a result, none of the four required criteria can be satisfied:

No prereq check results presented.
No auth/prereq gate surfaced.
No input questions asked (need ≥3 of: target repo, staging subscription ID, RBAC role, App Registration reuse decision, env name confirmation, onboarding mode).
No multi-environment awareness demonstrated (no mention of separate federated credential, azure-deploy-staging env name, SP reuse vs new, or per-env RBAC scoping).

All four criteria are missing.

✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.87): Prompt is trigger-aligned (score 0.87 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-sonnet-4.6

Results saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: CC10:D4778:174BE24:19BB8A4:6A3CB2F3)

✗ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup
✗ [3/4] Positive — Multi-environment onboarding

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.79 | Duration: 46.099s

Tests: 4 total, 2 passed, 2 failed, 0 errors
Success Rate: 50.0%
Score Range: 0.56 - 1.00 (σ=0.2022)

Task Results

Task	Score	Status	Graders
Negative — Storage service comparison (off-topic)	0.56	❌	budget, trigger_relevance_negative
Positive — First-time repo setup	1.00	✅	answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding	0.62	❌	answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision	0.98	✅	answer_quality, budget, trigger_relevance_positive

Failed Task Details

Negative — Storage service comparison (off-topic)

Run 1/1 (error):

✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_negative (0.11): Prompt correctly treated as non-trigger (score 0.11 < 0.50)

Positive — Multi-environment onboarding

Run 1/1 (failed):

❌ answer_quality (0.00): fail: Missing multi-environment awareness (criterion 4): Criteria 1, 2, 3 met: prereq table presented, Azure auth gate explicitly surfaced as blocking, and 5 numbered inputs requested. However criterion 4 (multi-environment awareness) is not satisfied — the response does not (a) mention creating a separate federated-credential entry for staging, (b) name the new azure-deploy-staging environment, (c) ask about reusing the existing App Registration / SP vs creating a new one for staging isolation, or (d) discuss per-environment secret/RBAC scoping. It treats this like a generic onboarding rather than an additive staging-env onboarding on an already-onboarded repo.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.87): Prompt is trigger-aligned (score 0.87 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.3-codex

Results saved to: .waza-results/git-ape-onboarding-gpt-5.3-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup
✓ [3/4] Positive — Multi-environment onboarding

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
[ERROR] waiting for session.idle: context deadline exceeded

✗ [2/4] Positive — First-time repo setup

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 100.0% (4/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: +50.0 percentage points

Per-Task Breakdown:
• Negative — Storage service comparison (off-topic) [NEUTRAL] 100% → 100% (+0pp)
• Positive — First-time repo setup [IMPROVED] 0% → 100% (+100pp)
• Positive — Multi-environment onboarding [IMPROVED] 0% → 100% (+100pp)
• Positive — Scaffold honors skip-with-notice on collision [NEUTRAL] 100% → 100% (+0pp)

Verdict: Skills have POSITIVE IMPACT (improved 2/4 tasks)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.87 | Duration: 36.179s

Tests: 4 total, 4 passed, 0 failed, 0 errors
Success Rate: 100.0%
Score Range: 0.56 - 1.00 (σ=0.1839)

Task Results

Task	Score	Status	Graders
Negative — Storage service comparison (off-topic)	0.56	✅	budget, trigger_relevance_negative
Positive — First-time repo setup	1.00	✅	answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding	0.96	✅	answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision	0.98	✅	answer_quality, budget, trigger_relevance_positive

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.4

Results saved to: .waza-results/git-ape-onboarding-gpt-5.4.json
JUnit XML saved to: .waza-results/git-ape-onboarding-gpt-5.4.junit.xml

🔢 Tokens (count + profile)

📊 git-ape-onboarding: 6,667 tokens (detailed ✓), 30 sections, 26 code blocks
   ⚠️  token count 6667 exceeds 3000

🎯 Quality (5-dim table)

time=2026-06-25T04:50:00.872Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Instructions are exceptionally well-ordered with numbered steps, canonical command examples, and explicit invariants. The two-mode distinction (CI/CD vs enterprise distribution) is clearly separated, and the 'First-turn rule' for agent behavior eliminates ambiguity about when to act vs. gather inputs.
completeness       █████  Covers prereqs, auth, multi-env scenarios, OIDC subject format variations, idempotency on re-run, optional drift detection, compliance preferences, landing zone discovery, and enterprise distribution. Edge cases like disabled subscriptions, org OIDC overrides, and collision handling during scaffolding are all explicitly addressed.
trigger_precision  ████░  USE FOR and DO NOT USE FOR triggers in the description and 'When to Use' section are well-defined and non-overlapping. Minor gap: 'rotating or updating an existing secret or federated credential' is mentioned as out-of-scope in prose but not in the frontmatter description trigger list, creating slight inconsistency between the two locations.
scope_coverage     █████  Scope boundaries are explicit throughout — the enterprise mode clearly states it configures tooling only (not Azure access), UI-only steps are flagged as hand-offs, and the drift detector step is marked optional with a clear dependency explanation. Neither over-broad nor too narrow.
anti_patterns      ████░  Avoids nearly all common anti-patterns: no vague instructions, no conflicting directives, good error handling (OIDC mismatch fix, disabled subscription check, partial-failure recovery). Minor issue: the 'Suggested Agent Flow' section partially duplicates the 'Command Playbook' numbered steps, which could cause an agent to second-guess which is authoritative — a brief cross-reference note would resolve this.
────────────────────────────────────────────
Overall: 4.6/5.0

An exceptionally well-crafted skill document. It is production-ready with comprehensive edge-case coverage, strong safety rails (safe-execution rules, invariants, explicit hand-offs for UI-only steps), and clear separation between two distinct modes. The two minor deductions are for slight trigger-list inconsistency between the frontmatter and prose, and minor duplication between the Command Playbook and Suggested Agent Flow that could confuse an agent about the single source of truth for step ordering.

✅ Check (compliance summary) (62 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/git-ape-onboarding/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: git-ape-onboarding

📋 Compliance Score: Medium-High
   ⚠️  Good, but could be improved. Missing routing clarity.

   Issues found:
   ❌  SKILL.md is 6667 tokens (hard limit 500)

📐 Spec Compliance: 9/9 checks passed
   ✅  Meets agentskills.io specification.

📎 Links: 11/15 valid
   ⚠️  4 link issue(s) found.
   ❌  [templates/copilot-instructions.md] → .github/skills/azure-stack-deploy/SKILL.md: target does not exist
   ❌  [templates/copilot-instructions.md] → website/docs/deployment/state.md: target does not exist
   ❌  [templates/copilot-instructions.md] → .github/skills/azure-stack-destroy/SKILL.md: target does not exist
   ⚠️  [templates/github-private/README.md] → agents/: target is a directory, not a file

📊 Token Budget: 6667 / 500 tokens
   ❌  Exceeds limit by 6167 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (6667 tokens, 0 modules)
   ❌  [negative-delta-risk] Negative delta risk patterns detected: excessive constraints (19 constraint keywords found)
   ✅  [procedural-content] Description contains procedural language
   ❌  [over-specificity] Over-specificity detected: absolute Windows paths
   ❌  [cross-model-density] Advisory 16: word count is 79 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found
   ❌  [progressive-disclosure] Advisory 18: progressive disclosure — SKILL.md body is 525 lines (>500 lines reduces scannability; consider moving detail to references/)
   ✅  [scope-reduction] Capability scope: 12 signal(s) detected (12 level-2 heading(s), 9 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix 3 broken link(s) — targets do not exist
4. Fix 1 link(s) pointing to directories instead of files
5. Reduce SKILL.md by 6167 tokens. Run 'waza tokens suggest' for optimization tips

Skill: `azure-stack-deploy`

📈 Score (per model) + Suggestions/Recommendations

Model: claude-sonnet-4.6

Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: D80B:68C07:1DFC1D2:2100AA2:6A3CB2F7)

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: D80B:68C07:1DFC216:2100AE3:6A3CB2FA)

✗ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [3/5] Negative — What-if preview / preflight validation
✓ [5/5] Positive — Re-deploy after template edit
✗ [4/5] Positive — Local deploy of an existing deployment artifact
✗ [1/5] Negative — Destroying / tearing down an existing deployment

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.78 | Duration: 1m26.656s

Tests: 5 total, 1 passed, 4 failed, 0 errors
Success Rate: 20.0%
Score Range: 0.60 - 0.86 (σ=0.0946)

Task Results

Task	Score	Status	Graders
Negative — Destroying / tearing down an existing deployment	0.86	❌	budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling)	0.60	❌	budget, trigger_relevance_negative
Negative — What-if preview / preflight validation	0.82	❌	budget, trigger_relevance_negative
Positive — Local deploy of an existing deployment artifact	0.78	❌	answer_quality, budget, trigger_relevance_positive
Positive — Re-deploy after template edit	0.85	✅	answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

Negative — Off-topic prompt (Linux kernel scheduling): 50% pass rate, score=0.60±0.00
Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Run 1/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Run 2/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Negative — Off-topic prompt (Linux kernel scheduling)

Run 1/2 (error):

✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_negative (0.20): Prompt correctly treated as non-trigger (score 0.20 < 0.50)

Negative — What-if preview / preflight validation

Run 1/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Run 2/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Positive — Local deploy of an existing deployment artifact

Run 1/2 (error):

❌ answer_quality (0.00): fail: No previous assistant response exists to grade: There is no prior assistant response in this session to evaluate. All four required criteria are missing: (1) no mention of az stack sub create, (2) no --action-on-unmanage deleteAll flag, (3) no reference to .github/skills/azure-stack-deploy/scripts/deploy-stack.sh or deploy-stack.ps1, (4) no mention of state.json (schemaVersion 1.0) capturing stack ID and managed resources.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)

Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 1401:17320D:1C063DF:1F24675:6A3CB2F9)

✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [4/5] Positive — Local deploy of an existing deployment artifact
✓ [5/5] Positive — Re-deploy after template edit
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [3/5] Negative — What-if preview / preflight validation

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.75 | Duration: 1m12.48s

Tests: 5 total, 2 passed, 3 failed, 0 errors
Success Rate: 40.0%
Score Range: 0.60 - 0.86 (σ=0.1167)

Task Results

Task	Score	Status	Graders
Negative — Destroying / tearing down an existing deployment	0.86	❌	budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling)	0.60	✅	budget, trigger_relevance_negative
Negative — What-if preview / preflight validation	0.82	❌	budget, trigger_relevance_negative
Positive — Local deploy of an existing deployment artifact	0.61	❌	answer_quality, budget, trigger_relevance_positive
Positive — Re-deploy after template edit	0.85	✅	answer_quality, budget, trigger_relevance_positive

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Run 1/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Run 2/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Negative — What-if preview / preflight validation

Run 1/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Run 2/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Positive — Local deploy of an existing deployment artifact

Run 1/2 (error):

❌ answer_quality (0.00): fail: No prior assistant response exists in this session to grade: There is no previous assistant response in the conversation to evaluate. All four required criteria are therefore missing: (1) no mention of az stack sub create; (2) no --action-on-unmanage deleteAll flag referenced; (3) no reference to .github/skills/azure-stack-deploy/scripts/deploy-stack.sh or deploy-stack.ps1; (4) no mention of state.json (schemaVersion 1.0) capturing stack ID and managed resources.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)

Run 2/2 (failed):

❌ answer_quality (0.00): fail: : Criterion 4 missing: response mentions state.json but does not specify schemaVersion 1.0 or that it captures stack ID and managed resources. Criteria 1, 2, 3 are met (az stack sub create, --action-on-unmanage deleteAll, and deploy-stack.sh script reference all present).
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)

Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5.3-codex

Results saved to: .waza-results/azure-stack-deploy-gpt-5.3-codex.json

🔢 Tokens (count + profile)

📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks

🎯 Quality (5-dim table)

time=2026-06-25T04:51:10.706Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious, steps are well-ordered with numbered procedures, code blocks are clean and consistent across bash/PowerShell, and output expectations are explicit. The 'What to tell the user after running' section eliminates ambiguity about agent response requirements.
completeness       █████  Covers prerequisites, arguments, failure modes, state schema, fallback behavior, soft-deletable resource classification, and cross-skill references. Edge cases like race conditions, policy blocks, and missing parameters.json are all addressed.
trigger_precision  ████░  USE FOR and DO NOT USE FOR sections are clear and well-separated with explicit anti-cases (destroy, what-if, IaC authoring). Slightly loses a point because the boundary between 'local deploy' and 'CI deploy' could confuse agents — the skill says it matches CI but also implies local-only use.
scope_coverage     █████  Scope is tightly defined: subscription-scoped stack creation only, with explicit out-of-scope redirects to three other named skills. Capabilities (stack vs fallback path, state.json writing, metadata update) and limitations (no template generation, no destroy) are explicit.
anti_patterns      ████░  Avoids vague instructions and conflicting directives well. The fallback behavior is disclosed with a clear trade-off warning. Minor issue: the idempotency claim ('stacks de-duplicate on --name') in the race condition recovery row is stated without caveats about concurrent deployments, which could mislead in multi-agent scenarios.
────────────────────────────────────────────
Overall: 4.6/5.0

High-quality skill definition with exceptional completeness and clarity. The schema example, failure table, soft-delete tracking, and mandatory post-run reply format are standout elements. Minor deductions for a subtle local-vs-CI scope ambiguity and an uncaveated idempotency claim under concurrent conditions. Ready for production use with minimal revision.

✅ Check (compliance summary) (70 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-stack-deploy

📋 Compliance Score: Low
   ❌  Needs significant improvement. Description too short or missing triggers.

   Issues found:
   ❌  SKILL.md is 1912 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 0/8 valid
   ⚠️  8 link issue(s) found.
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory

📊 Token Budget: 1912 / 500 tokens
   ❌  Exceeds limit by 1412 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Description density is optimal for cross-model use
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips

Skill: `azure-stack-destroy`

📈 Score (per model) + Suggestions/Recommendations

Model: claude-sonnet-4.6

Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 3501:177763:1D39A50:203F505:6A3CB2FC)

✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✓ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 1m35.88s

Tests: 5 total, 2 passed, 3 failed, 0 errors
Success Rate: 40.0%
Score Range: 0.60 - 0.96 (σ=0.1399)

Task Results

Task	Score	Status	Graders
Negative — Deploying a new stack (opposite operation)	0.81	❌	budget, trigger_relevance_negative
Negative — Deleting a non-Git-Ape resource group	0.87	❌	budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling)	0.60	✅	budget, trigger_relevance_negative
Positive — Clean up the deployment stack	0.62	❌	answer_quality, budget, trigger_relevance_positive
Positive — Local destroy of a Git-Ape deployment	0.96	✅	answer_quality, budget, trigger_relevance_positive

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Run 1/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Run 2/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Negative — Deleting a non-Git-Ape resource group

Run 1/2 (error):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Run 2/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Positive — Clean up the deployment stack

Run 1/2 (failed):

❌ answer_quality (0.00): fail: Missing several required PASS criteria: Criterion 1 partial: The response recommends the destroy script, but does NOT explicitly explain that raw az group delete is inadequate because it misses soft-delete cleanup and multi-RG resources. In fact, Option C actively suggests az group delete as a workaround without warning about its limitations.

Criterion 3 missing: The response never explicitly mentions az stack sub delete --action-on-unmanage deleteAll or describes the stack-delete semantics (one delete cleans every resource the stack owns). It only refers to the script wrapper.

Criterion 4 missing: The response does not describe the skill's automatic soft-delete purge sweep (Key Vault, Cognitive Services purged after stack delete), nor does it mention that resources flagged purgeProtected: true in state.json are intentionally retained. Option C mentions manual Key Vault purge only as a fallback when bypassing the skill entirely.

Criterion 2 met: state.json prerequisite is clearly called out.

✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Run 2/2 (failed):

❌ answer_quality (0.00): fail: Missing explicit justification for skill vs raw az group delete, and skill's purge sweep behavior: Criterion 1: Recommended the destroy script first, but did not explicitly explain that raw az group delete misses soft-delete cleanup and multi-RG/subscription-scoped resources — instead offered it as a casual alternative ("option 2"). Criterion 4: Did not mention the skill's automatic soft-delete purge sweep behavior (Key Vault / Cognitive Services purged after stack delete) nor the purgeProtected: true retention semantics; only mentioned manual Key Vault purge as a workaround.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.71 | Duration: 1m4.167s

Tests: 5 total, 1 passed, 4 failed, 0 errors
Success Rate: 20.0%
Score Range: 0.60 - 0.87 (σ=0.1093)

Task Results

Task	Score	Status	Graders
Negative — Deploying a new stack (opposite operation)	0.81	❌	budget, trigger_relevance_negative
Negative — Deleting a non-Git-Ape resource group	0.87	❌	budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling)	0.60	✅	budget, trigger_relevance_negative
Positive — Clean up the deployment stack	0.62	❌	answer_quality, budget, trigger_relevance_positive
Positive — Local destroy of a Git-Ape deployment	0.63	❌	answer_quality, budget, trigger_relevance_positive

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Run 1/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Run 2/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Negative — Deleting a non-Git-Ape resource group

Run 1/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Run 2/2 (failed):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Positive — Clean up the deployment stack

Run 1/2 (failed):

❌ answer_quality (0.00): fail: : Missing criteria 1 (did not explicitly explain that raw az group delete is wrong because it misses soft-delete cleanup and multi-RG resources), 3 (did not mention az stack sub delete --action-on-unmanage deleteAll or its semantics), and 4 (did not describe the soft-delete purge sweep for Key Vault/Cognitive Services nor the purgeProtected: true retention behavior). The response only stated that state.json was missing and gave the script invocation without explaining the destroy semantics.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Run 2/2 (failed):

❌ answer_quality (0.00): fail: : Missing criteria: (1) Did not explicitly explain that raw az group delete is wrong because it misses soft-delete cleanup and multi-RG resources — instead even suggested az group delete as a fallback. (3) Did not mention az stack sub delete --action-on-unmanage deleteAll or its semantics. (4) Only briefly mentioned "soft-delete purge" in passing without covering Key Vault / Cognitive Services purge behavior or purgeProtected: true retention. Only criterion (2) (state.json path requirement) was clearly met.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Positive — Local destroy of a Git-Ape deployment

Run 1/2 (failed):

❌ answer_quality (0.00): fail: : Missing criteria 3 and 4. The response invoked the destroy script and referenced state.json under .azure/deployments/deploy-20260506-001/, but did not name the stack-delete command az stack sub delete --action-on-unmanage deleteAll or its semantics, and did not address the soft-delete purge sweep / az keyvault purge for reusing the Key Vault name.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)

Run 2/2 (failed):

❌ answer_quality (0.00): fail: : Missing criteria 3 and 4: response did not name az stack sub delete --action-on-unmanage deleteAll semantics, and did not explicitly mention az keyvault purge / az keyvault list-deleted or describe the purge sweep mechanics for non-purge-protected vaults. Criteria 1 (invoked destroy-stack.sh) and 2 (referenced state.json under .azure/deployments/deploy-20260506-001/) were met.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)

Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5.3-codex

Results saved to: .waza-results/azure-stack-destroy-gpt-5.3-codex.json

🔢 Tokens (count + profile)

📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks

🎯 Quality (5-dim table)

time=2026-06-25T04:51:39.777Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            ████░  Exceptionally well-structured with tables, code examples, and fast vs sync mode comparison. Minor clarity issue: the 'When to Use' section (after DO NOT USE FOR) nearly duplicates the 'USE FOR' section, creating redundant reading and potential confusion about which block is authoritative.
completeness       █████  Excellent coverage of prerequisites, procedure steps, all argument flags, failure modes with recovery paths, state.json field semantics, and terminal statuses. Soft-delete purge behavior and purge-protection edge cases are explicitly documented — difficult to find gaps.
trigger_precision  ████░  USE FOR and DO NOT USE FOR are precise with concrete user phrases and clear exclusion rationale. However, the duplicate 'When to Use' section after DO NOT USE FOR creates ambiguity about canonical trigger definitions; consolidating them would sharpen routing accuracy.
scope_coverage     █████  Scope is tightly and explicitly bounded: Git-Ape deployments only, requires state.json, full-stack teardown only (no surgical mode). The 'Prefer this over raw az group delete' subsection proactively closes a common mis-use path, and limitations are stated without being vague.
anti_patterns      ████░  No conflicting directives, strong error-handling guidance, and instructions explain 'why' alongside 'what' — all good. The one notable anti-pattern is the duplicated trigger content ('USE FOR' vs 'When to Use'), which adds noise and could lead an agent to inconsistently weigh the two blocks.
────────────────────────────────────────────
Overall: 4.4/5.0

A high-quality, production-ready skill document with thorough failure-mode coverage, excellent scope definition, and strong prerequisite documentation. The primary improvement opportunity is removing the redundant 'When to Use' section that duplicates 'USE FOR', which would tighten trigger precision and eliminate the only meaningful structural anti-pattern.

✅ Check (compliance summary) (69 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-stack-destroy

📋 Compliance Score: Low
   ❌  Needs significant improvement. Description too short or missing triggers.

   Issues found:
   ❌  SKILL.md is 2644 tokens (hard limit 500)

📐 Spec Compliance: 7/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
   ❌  [spec-security] Security risks detected: description contains XML angle brackets
     📎  XML angle brackets and reserved prefixes pose injection and naming conflict risks

📎 Links: 0/4 valid
   ⚠️  4 link issue(s) found.
   ❌  [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory

📊 Token Budget: 2644 / 500 tokens
   ❌  Exceeds limit by 2144 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips

Skill: `azure-landing-zone-discovery`

📈 Score (per model) + Suggestions/Recommendations

Model: claude-sonnet-4.6

Running benchmark: azure-landing-zone-discovery-eval
Skill: azure-landing-zone-discovery
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] prompt references relative paths but no workspace files were loaded; use inputs.files to copy fixtures into the sandbox

✗ [4/4] Positive — Manual landing-zone context injection
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: F819:13968D:15A1D10:181EFE0:6A3CB2F9)

✗ [2/4] Negative — CAF naming lookup (off-topic)
✓ [3/4] Positive — Discover the landing zone
[ERROR] waiting for session.idle: context deadline exceeded

✗ [1/4] Negative — Plain function-app deployment (off-topic)

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.67 | Duration: 2m0.102s

Tests: 4 total, 1 passed, 3 failed, 0 errors
Success Rate: 25.0%
Score Range: 0.00 - 0.97 (σ=0.3932)

Task Results

Task	Score	Status	Graders
Negative — Plain function-app deployment (off-topic)	0.77	❌	budget, trigger_relevance_negative
Negative — CAF naming lookup (off-topic)	0.93	❌	budget, trigger_relevance_negative
Positive — Discover the landing zone	0.97	✅	answer_quality, budget, trigger_relevance_positive
Positive — Manual landing-zone context injection	0.00	❌	-

Failed Task Details

Negative — Plain function-app deployment (off-topic)

Run 1/1 (error):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.55): Prompt appears trigger-aligned unexpectedly (score 0.55 >= 0.50)

Negative — CAF naming lookup (off-topic)

Run 1/1 (error):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.86): Prompt appears trigger-aligned unexpectedly (score 0.86 >= 0.50)

Positive — Manual landing-zone context injection

Run 1/1 (error):

Benchmark: azure-landing-zone-discovery-eval | Skill: azure-landing-zone-discovery | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-landing-zone-discovery-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: azure-landing-zone-discovery-eval
Skill: azure-landing-zone-discovery
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] prompt references relative paths but no workspace files were loaded; use inputs.files to copy fixtures into the sandbox

✗ [4/4] Positive — Manual landing-zone context injection
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 1819:2A3231:1D2A535:2043E17:6A3CB2F9)

✗ [2/4] Negative — CAF naming lookup (off-topic)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 181B:290FB7:1B242CF:1E4584F:6A3CB2F4)

✗ [3/4] Positive — Discover the landing zone
[ERROR] waiting for session.idle: context deadline exceeded

✗ [1/4] Negative — Plain function-app deployment (off-topic)

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.59 | Duration: 2m0.059s

Tests: 4 total, 0 passed, 4 failed, 0 errors
Success Rate: 0.0%
Score Range: 0.00 - 0.93 (σ=0.3530)

Task Results

Task	Score	Status	Graders
Negative — Plain function-app deployment (off-topic)	0.77	❌	budget, trigger_relevance_negative
Negative — CAF naming lookup (off-topic)	0.93	❌	budget, trigger_relevance_negative
Positive — Discover the landing zone	0.64	❌	answer_quality, budget, trigger_relevance_positive
Positive — Manual landing-zone context injection	0.00	❌	-

Failed Task Details

Negative — Plain function-app deployment (off-topic)

Run 1/1 (error):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.55): Prompt appears trigger-aligned unexpectedly (score 0.55 >= 0.50)

Negative — CAF naming lookup (off-topic)

Run 1/1 (error):

✅ budget (1.00): All behavior checks passed
❌ trigger_relevance_negative (0.86): Prompt appears trigger-aligned unexpectedly (score 0.86 >= 0.50)

Positive — Discover the landing zone

Run 1/1 (error):

❌ answer_quality (0.00): fail: Missing all 4 criteria — no prior response to evaluate: No prior assistant response exists in the session to grade. The assistant has not yet responded to the user's landing zone discovery request, so none of the four PASS criteria are met: (1) no reference to the azure-landing-zone-discovery skill or discover-lz.sh, (2) no mention of .azure/landing-zone-context.json output artifact, (3) no mention of management group hierarchy / subscription classification / hub-spoke / policy / shared services, (4) no acknowledgment of permission limitations or inject-lz.sh fallback.
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.92): Prompt is trigger-aligned (score 0.92 >= 0.50)

Positive — Manual landing-zone context injection

Run 1/1 (error):

Benchmark: azure-landing-zone-discovery-eval | Skill: azure-landing-zone-discovery | Model: gpt-5.3-codex

Results saved to: .waza-results/azure-landing-zone-discovery-gpt-5.3-codex.json

🔢 Tokens (count + profile)

📊 azure-landing-zone-discovery: 7,117 tokens (detailed ✓), 27 sections, 17 code blocks
   ⚠️  token count 7117 exceeds 3000

🎯 Quality (5-dim table)

time=2026-06-25T04:51:58.714Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            ████░  The skill is well-structured with clear section headers, tables, and code examples. However, the procedure mixes discovery steps with inline code that references scripts which may not exist yet, creating ambiguity about whether the agent runs scripts or executes the inline commands directly. Step 8 says 'No manual assembly is needed' but the preceding steps show manual assembly logic.
completeness       █████  Exceptionally thorough — covers auto-discovery, manual injection, cross-tenant edge cases, RBAC fallbacks, policy classification, confidence scoring, and downstream integration. The weighted signal table, confidence buckets, edge cases table, and policy effect classification table leave very little to guesswork.
trigger_precision  █████  USE FOR and DO NOT USE FOR triggers are crisp, non-overlapping, and include concrete examples with explicit redirects to alternative skills. The boundary between landing-zone topology (this skill) vs. per-resource actions (other skills) is clearly articulated.
scope_coverage     ████░  Scope is well-defined with explicit capability boundaries and integration points. Minor gap: the skill doesn't clarify what happens if the discovery scripts don't exist in the repo yet (first-run bootstrap scenario), nor does it specify whether the agent should generate those scripts or expect them pre-installed.
anti_patterns      ████░  Avoids most anti-patterns — error handling is explicit, fallbacks are documented, and the manual injection precedence (script > direct edit > questionnaire) is clearly ordered. Minor issue: Step 8 instructs the agent to 'summarize the result back to the user' but doesn't specify a format, leaving output consistency to chance across invocations.
────────────────────────────────────────────
Overall: 4.4/5.0

A high-quality, production-grade skill document. It demonstrates exceptional completeness with its confidence scoring model, weighted signals, edge case tables, and policy effect classification. Trigger precision and anti-pattern avoidance are strong. The main areas for improvement are clarifying whether inline code snippets vs. scripts are the authoritative execution path, and addressing the bootstrap scenario where discovery scripts haven't been installed yet.

✅ Check (compliance summary) (59 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-landing-zone-discovery/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-landing-zone-discovery

📋 Compliance Score: Medium-High
   ⚠️  Good, but could be improved. Missing routing clarity.

   Issues found:
   ❌  SKILL.md is 7117 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, last_updated, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 2/2 valid
   ✅  All links valid.

📊 Token Budget: 7117 / 500 tokens
   ❌  Exceeds limit by 6617 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (7117 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ❌  [over-specificity] Over-specificity detected: IP addresses, hardcoded URLs with paths
   ❌  [cross-model-density] Advisory 16: word count is 65 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found
   ❌  [progressive-disclosure] Advisory 18: progressive disclosure — SKILL.md body is 587 lines (>500 lines reduces scannability; consider moving detail to references/); 1 code block(s) exceed 50 lines (suggest moving to references/)
   ✅  [scope-reduction] Capability scope: 9 signal(s) detected (9 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, last_updated, user-invocable
4. Reduce SKILL.md by 6617 tokens. Run 'waza tokens suggest' for optimization tips

Skill: `azure-policy-advisor`

📈 Score (per model) + Suggestions/Recommendations

Model: claude-sonnet-4.6

Running benchmark: azure-policy-advisor-eval
Skill: azure-policy-advisor
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 4803:1B6C88:1CC97EA:1FD49E0:6A3CB2FC)

✓ [3/5] Negative — Off-topic Linux kernel question
✓ [2/5] Negative — CAF naming lookup (azure-naming-research territory)
✓ [5/5] Positive — Compliance framework audit (CIS)
✗ [1/5] Negative — Pricing / cost estimation (azure-cost-estimator territory)
✓ [4/5] Positive — Post-template-generation policy recommendations

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.78 | Duration: 2m53.766s

Tests: 5 total, 4 passed, 1 failed, 0 errors
Success Rate: 80.0%
Score Range: 0.61 - 0.90 (σ=0.1064)

Task Results

Task	Score	Status	Graders
Negative — Pricing / cost estimation (azure-cost-estimator territory)	0.61	❌	budget, out_of_scope_acknowledgement, trigger_relevance_negative
Negative — CAF naming lookup (azure-naming-research territory)	0.77	✅	budget, out_of_scope_acknowledgement, trigger_relevance_negative
Negative — Off-topic Linux kernel question	0.75	✅	budget, out_of_scope_acknowledgement, trigger_relevance_negative
Positive — Post-template-generation policy recommendations	0.90	✅	answer_quality, budget, trigger_relevance_positive
Positive — Compliance framework audit (CIS)	0.89	✅	answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

Negative — Pricing / cost estimation (azure-cost-estimator territory): 50% pass rate, score=0.61±0.17

Failed Task Details

Negative — Pricing / cost estimation (azure-cost-estimator territory)

Run 1/2 (error):

✅ budget (1.00): All behavior checks passed
❌ out_of_scope_acknowledgement (0.00): fail: : No previous assistant response exists in the session to evaluate. Cannot verify that the response (1) avoids policy/governance recommendations and (2) either answers the cost question with retail pricing or routes to the cost-estimation skill.
✅ trigger_relevance_negative (0.32): Prompt correctly treated as non-trigger (score 0.32 < 0.50)

Benchmark: azure-policy-advisor-eval | Skill: azure-policy-advisor | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-policy-advisor-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: azure-policy-advisor-eval
Skill: azure-policy-advisor
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: BC42:D4778:174C4A9:19BBF77:6A3CB2F9)

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: BC44:24A621:16A4364:19188B5:6A3CB2F9)

✗ [4/5] Positive — Post-template-generation policy recommendations
✗ [3/5] Negative — Off-topic Linux kernel question
✗ [5/5] Positive — Compliance framework audit (CIS)
✓ [2/5] Negative — CAF naming lookup (azure-naming-research territory)
✓ [1/5] Negative — Pricing / cost estimation (azure-cost-estimator territory)

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.75 | Duration: 2m29.946s

Tests: 5 total, 2 passed, 3 failed, 0 errors
Success Rate: 40.0%
Score Range: 0.72 - 0.77 (σ=0.0201)

Task Results

Task	Score	Status	Graders
Negative — Pricing / cost estimation (azure-cost-estimator territory)	0.77	✅	budget, out_of_scope_acknowledgement, trigger_relevance_negative
Negative — CAF naming lookup (azure-naming-research territory)	0.77	✅	budget, out_of_scope_acknowledgement, trigger_relevance_negative
Negative — Off-topic Linux kernel question	0.75	❌	budget, out_of_scope_acknowledgement, trigger_relevance_negative
Positive — Post-template-generation policy recommendations	0.73	❌	answer_quality, budget, trigger_relevance_positive
Positive — Compliance framework audit (CIS)	0.72	❌	answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

Negative — Off-topic Linux kernel question: 50% pass rate, score=0.75±0.00
Positive — Post-template-generation policy recommendations: 50% pass rate, score=0.73±0.17
Positive — Compliance framework audit (CIS): 50% pass rate, score=0.72±0.17

Failed Task Details

Negative — Off-topic Linux kernel question

Run 1/2 (error):

✅ budget (1.00): All behavior checks passed
✅ out_of_scope_acknowledgement (1.00): All prompts passed
✅ trigger_relevance_negative (0.24): Prompt correctly treated as non-trigger (score 0.24 < 0.50)

Positive — Post-template-generation policy recommendations

Run 1/2 (error):

❌ answer_quality (0.00): fail: No prior assistant response exists in this session to grade: There is no previous assistant response in the conversation to evaluate. The session contains only the user's original request followed immediately by the grading instructions. With no response present:
Criterion 1 (covers all three resource types): not met — nothing to evaluate
Criterion 2 (≥2 recognizable built-in Azure Policy checks by name): not met — nothing to evaluate
Criterion 3 (both template-level fixes AND subscription-level policy assignments tracks): not met — nothing to evaluate
Criterion 4 (Microsoft Learn link, built-in policy GUID, or named policies with a verification caveat): not met — nothing to evaluate

All four PASS criteria are unmet because no response content exists.

✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.69): Prompt is trigger-aligned (score 0.69 >= 0.50)

Positive — Compliance framework audit (CIS)

Run 1/2 (failed):

❌ answer_quality (0.00): fail: Missing verification note for initiative ID: Criterion 1 not fully met: the response recommends assigning the CIS Azure Foundations regulatory compliance initiative, but does not note that the current initiative ID / display name should be verified from Microsoft Learn or via az policy set-definition list before assignment. Criteria 2, 3, and 4 are satisfied (initiative-vs-individual trade-off discussed; audit-first with staged promotion to deny recommended; storage, SQL, VM, and monitoring/diagnostics all covered with named controls).
✅ budget (1.00): All behavior checks passed
✅ trigger_relevance_positive (0.67): Prompt is trigger-aligned (score 0.67 >= 0.50)

Benchmark: azure-policy-advisor-eval | Skill: azure-policy-advisor | Model: gpt-5.3-codex

Results saved to: .waza-results/azure-policy-advisor-gpt-5.3-codex.json

🔢 Tokens (count + profile)

📊 azure-policy-advisor: 5,118 tokens (detailed ✓), 29 sections, 14 code blocks
   ⚠️  token count 5118 exceeds 3000

🎯 Quality (5-dim table)

time=2026-06-25T04:52:55.201Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            ████░  Instructions are well-structured with clear step-by-step progression, code blocks, and tables. However, the skill is dense — critical steps like 'read this reference file before proceeding' appear mid-paragraph and could be missed. Consider using callout boxes or bold warnings for mandatory reads.
completeness       █████  Exceptional coverage: handles az unavailability, LZ context, confidence gating, management group scope, deduplication logic, output artifacts with schema references, troubleshooting table, and integration points. Edge cases (government clouds, initiative member expansion gap) are explicitly documented.
trigger_precision  █████  USE FOR and DO NOT USE FOR sections are precise, non-overlapping, and redirect to specific alternative skills by name. The boundary between this skill and azure-security-analyzer (policy enforcement vs resource config) is clearly articulated and would route correctly.
scope_coverage     █████  Scope is explicitly bounded: assesses ARM templates and subscription policy state, does NOT enumerate live deployed resources. Capabilities (two-part report, CLI+ARM implementation options, JSON sidecar) and limitations (initiative member expansion not yet supported) are both documented.
anti_patterns      ████░  Avoids most anti-patterns well: error handling is explicit, references are externalized rather than inlined, and the advisory-only policy gate prevents over-prescription. Minor issue: references to external files (references/classification-rules.yaml, references/policy-assessment-template.md) create hidden dependencies — if those files are missing, the agent has no fallback guidance.
────────────────────────────────────────────
Overall: 4.6/5.0

A high-quality, production-grade skill with exceptional completeness, precise routing triggers, and explicit scope boundaries. The main improvement areas are reducing cognitive load by surfacing mandatory external file reads more prominently, and adding fallback behavior when referenced files (classification-rules.yaml, policy-assessment-template.md) are unavailable.

✅ Check (compliance summary) (57 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-policy-advisor/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-policy-advisor

📋 Compliance Score: High
   ✅  Excellent! Your skill meets all compliance requirements.

   Issues found:
   ❌  SKILL.md is 5118 tokens (hard limit 500)

📐 Spec Compliance: 9/9 checks passed
   ✅  Meets agentskills.io specification.

🔌 MCP Integration: 4/4
   ✅  All MCP integration checks passed.

📎 Links: 5/5 valid
   ✅  All links valid.

📊 Token Budget: 5118 / 500 tokens
   ❌  Exceeds limit by 4618 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 3 reference modules (2-3 is optimal)
   ❌  [complexity] Complexity: comprehensive (5118 tokens, 3 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ❌  [cross-model-density] Advisory 16: word count is 113 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
   ✅  [body-structure] Advisory 17: body structure quality
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 12 signal(s) detected (12 level-2 heading(s), 4 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Reduce SKILL.md by 4618 tokens. Run 'waza tokens suggest' for optimization tips

- add a Do NOT use for block and tighten the description for routing - replace the Step 8 cat stub with assembly summary and report-back guidance - make manual injection prescriptive about inject-lz flags 🧭 - Generated by Copilot

The LZ feature added a Landing Zone Context section to the mirror .github/copilot-instructions.md but not its canonical onboarding template, tripping the template mirror sync CI gate. Port the section into the template so the two stay byte-identical. 🧭 - Generated by Copilot

…-zone-discovery # Conflicts: # .github/evals/manifest.yaml

Resolve conflicts in git-ape-onboarding SKILL.md and docs: keep main's drift-detector Step 10 and Compliance Step 11, renumber Landing Zone Discovery to Step 12, and retain the Enterprise Distribution mode section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

QA against a live canonical Azure Landing Zone tenant showed discover-lz scored it low/30/isLandingZone=false. Four scorer/discovery bugs caused the miss; a fifth broke the documented manual-injection fallback. discover-lz (.sh + .ps1): - Classify Corp/Online MGs via substring fallback. The ALZ accelerator prefixes MG names (e.g. "alz-corp") so the exact-name checks never fire; the fallback had tests for every archetype except corp/online. - Query policy assignments at every discovered management-group scope, not just the subscription. Canonical ALZ policies live at MG scope (alz, alz-platform) and were invisible. Uses default atScope() per MG since --disable-scope-strict-match errors at MG scope. - Refresh the canonical policy-name pattern (Deploy-AzActivity-Log -> Deploy-AzActivityLog, add Deploy-VM-Monitoring, Deploy-VMSS-Monitoring, Deny-Classic-Resources, ...) and match against BOTH .name and .displayName (the canonical token lives in .name; displayName is a long human description). inject-lz (.sh + .ps1): - Emit a landingZoneDetection block (source=manual, confidence high by default) so the manual fallback actually flips LZ-aware behaviour. Adds --confidence/-Confidence and --not-landing-zone/-NotLandingZone. On --merge an explicit confidence overrides; otherwise injection only raises. Live re-run after fixes: medium/55/isLandingZone=true (signals: alz-top-level-mgs 30, alz-lz-archetypes 10, alz-canonical-policies 15). Docs (SKILL.md + generated mirror) document the new flags. shellcheck (severity=warning) and PSScriptAnalyzer (Error+Warning) pass; bash -n and the PowerShell parser pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sendtoshailesh

Review: strong skill, one real bug + parity gaps before merge

Really nice addition — the dual-shell design is well thought out, and I verified a lot of it end-to-end. Approving is one fix (plus two small parity fixes) away.

✅ What I verified

Confidence scorer parity holds (the headline risk): signal weights 30/20/10/10/5/5/15 and thresholds (≥70 high / ≥40 medium / ≥10 low / else none) are identical between discover-lz.sh:752-774 and the .ps1 port, and match SKILL.md.
inject-lz.sh works end-to-end: create + --merge produce valid, well-structured JSON; resource-ID parsing extracts name/subscription correctly; merge preserves existing data (I confirmed policies.alzCanonicalAssignments survives a merge).
bash scripts are bash -n + shellcheck --severity=warning clean; set -euo pipefail; all az calls are read-only; no command-injection or secret-leak patterns.
Agent wiring uses the canonical landing-zone-context.json path and azure-landing-zone-discovery skill name consistently across all 5 agents + copilot-instructions.md.
Generated doc mirrors are in sync (only pre-existing, unrelated website/docs/workflows/* drift appears on regen); README / intro / sidebars / overview are clean additive entries.

🔴 Blocking — `discover-lz.sh` crashes with `--skip-network`

HAS_GRAPH is only assigned inside the network block (discover-lz.sh:491). When --skip-network is passed, the else branch (discover-lz.sh:593-595) never sets it, and the shared-services block then references $HAS_GRAPH under set -u at discover-lz.sh:610 → HAS_GRAPH: unbound variable, script aborts. The .ps1 port initializes $HasGraph = $false up front, so it does not crash — this is both a correctness bug and a cross-shell divergence, on a path that the tests/fixtures/landing-zone/skipped-network.json fixture implies is supported.

Fix: initialize HAS_GRAPH=false alongside the other defaults (near discover-lz.sh:480, before the if [[ "$SKIP_NETWORK" != "true" ]] block).

🟡 Parity gaps — these undercut the "byte-compatible JSON" claim

Since byte-level cross-shell parity is the headline feature, these are worth fixing:

Injected evidence string differs between ports. inject-lz.sh:343 emits ...asserted via inject-lz.sh (--confidence X) while inject-lz.ps1:273 emits ...asserted via inject-lz.ps1 (-Confidence X). The resulting landing-zone-context.json is therefore not byte-identical across shells. Use shell-neutral evidence text.
List parsing diverges. Bash preserves spaces/empty entries (inject-lz.sh:298,303) while PowerShell trims and drops empties (inject-lz.ps1:238,244), so e.g. --allowed-locations "eastus, westus2" yields different arrays. Normalize (trim + drop-empty) in both ports.

ℹ️ Non-blocking nits (optional)

Discover array ordering can diverge: bash unique/unique_by (jq, sorted) vs PS Select-Object -Unique (first-seen).
--output-format validation differs: bash accepts anything and falls back to JSON (discover-lz.sh:60,910); PS restricts to json|markdown.
--not-landing-zone + --confidence precedence is parse-order dependent in bash (inject-lz.sh:118-126) vs always-forces-none in PS (inject-lz.ps1:94).
Resource-ID subscription parse is case-sensitive in bash (inject-lz.sh:190) but case-insensitive in PS (inject-lz.ps1:125).
There's no cross-shell parity smoke CI job for this skill (the onboarding scaffolders have one). Adding discover/inject .sh vs .ps1 produce identical JSON to CI would prevent silent parity regressions — and would have caught items 1–2 above.

Happy to re-review quickly once the --skip-network crash is fixed and the two parity items are addressed. Great work overall. 🐒

arnaudlh added 3 commits June 15, 2026 10:57

Merge remote-tracking branch 'upstream/main' into copilot/add-landing…

d8d0068

…-zone-discovery # Conflicts: # .github/evals/manifest.yaml

arnaudlh requested a review from sendtoshailesh June 15, 2026 03:27

arnaudlh added this to the v0.4.0 milestone Jun 16, 2026

github-actions Bot mentioned this pull request Jun 25, 2026

[repo-status] 🐒 Git-Ape Daily Status — June 25, 2026 #210

Closed

arnaudlh and others added 2 commits June 25, 2026 12:16

sendtoshailesh requested changes Jun 25, 2026

View reviewed changes

Uh oh!

Conversation

arnaudlh commented Jun 15, 2026

Summary

What's included

Validation

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Documentation Staleness Warning

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Waza agent evals (advisory)

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 Waza skill evals (advisory)

Skill: prereq-check

🧪 Waza Eval Results

Task Results

⚠️ Flaky Tasks

Failed Task Details

Positive — "command not found" failure

🧪 Waza Eval Results

Task Results

⚠️ Flaky Tasks

Failed Task Details

Positive — "command not found" failure

🧪 Waza Eval Results

Task Results

⚠️ Flaky Tasks

Failed Task Details

Negative — Editing an ARM template

🧪 Waza Eval Results

Task Results

⚠️ Flaky Tasks

Failed Task Details

Negative — Editing an ARM template

Positive — "command not found" failure

Skill: git-ape-onboarding

🧪 Waza Eval Results

Task Results

Failed Task Details

Positive — Multi-environment onboarding

🧪 Waza Eval Results

Task Results

Failed Task Details

Positive — Multi-environment onboarding

🧪 Waza Eval Results

Task Results

Failed Task Details

Negative — Storage service comparison (off-topic)

Positive — Multi-environment onboarding

🧪 Waza Eval Results

Task Results

Skill: azure-stack-deploy

🧪 Waza Eval Results

Task Results

⚠️ Flaky Tasks

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Negative — Off-topic prompt (Linux kernel scheduling)

Negative — What-if preview / preflight validation

Positive — Local deploy of an existing deployment artifact

🧪 Waza Eval Results

Task Results

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Negative — What-if preview / preflight validation

Positive — Local deploy of an existing deployment artifact

Skill: azure-stack-destroy

🧪 Waza Eval Results

Task Results

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Negative — Deleting a non-Git-Ape resource group

Positive — Clean up the deployment stack

🧪 Waza Eval Results

Task Results

github-actions Bot commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

Skill: `prereq-check`

Skill: `git-ape-onboarding`

Skill: `azure-stack-deploy`

Skill: `azure-stack-destroy`

Skill: `azure-landing-zone-discovery`

Skill: `azure-policy-advisor`

🔴 Blocking — `discover-lz.sh` crashes with `--skip-network`