feat(skills): add landing zone discovery skill with cross-shell scripts#177
feat(skills): add landing zone discovery skill with cross-shell scripts#177arnaudlh wants to merge 6 commits into
Conversation
- add azure-landing-zone-discovery skill, evals, fixtures, and docs - ship discover-lz/inject-lz in both bash and PowerShell parity ports - document dual-shell helper-script convention in authoring docs - wire landing-zone context into agents and copilot-instructions 🧭 - Generated by Copilot
|
🤖 Waza agent evals (advisory)
Ran 0 agent evals against
📊 Agent file token comparison vs
|
🧪 Waza skill evals (advisory)
Ran 16 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg
📊 Token comparison vs
|
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 1/3 (error):
- ❌ answer_quality (0.00): fail: No previous assistant response exists to grade: There is no prior assistant response in this session to evaluate. The conversation only contains the user's question and the grading instruction — none of the four PASS criteria (naming az/gh/jq/git, install command, version verification, verdict/next step) can be satisfied because no response was produced.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6
Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
Model: claude-sonnet-4.6
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
[ERROR] waiting for session.idle: context deadline exceeded
✓ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.77 | Duration: 1m54.666s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.57 - 1.00 (σ=0.1839)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 1/3 (error):
- ❌ answer_quality (0.00): fail: No previous assistant response exists to grade: There is no prior assistant response in this session to evaluate. None of the four PASS criteria can be met: (1) no tools named, (2) no install command provided, (3) no version verification recommended, (4) no verdict/next step given.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6
Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] waiting for session.idle: context deadline exceeded
✓ [2/4] Negative — Azure service concept question
✗ [1/4] Negative — Editing an ARM template
✓ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.79 | Duration: 1m30.15s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.57 - 1.00 (σ=0.2074)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ❌ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Editing an ARM template: 67% pass rate, score=0.57±0.00
Failed Task Details
Negative — Editing an ARM template
Run 1/3 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.14): Prompt correctly treated as non-trigger (score 0.14 < 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.3-codex
Results saved to: .waza-results/prereq-check-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 8C11:3E4A2D:1CB8715:1FC6DDC:6A3CB312)
✓ [2/4] Negative — Azure service concept question
✗ [1/4] Negative — Editing an ARM template
✓ [4/4] Positive — "What do I need to install?"
[ERROR] session error: Execution failed: CAPIError: 422 422 Unprocessable Entity
(Request ID: 8C12:3E4A2D:1CC150E:1FD0981:6A3CB32F)
✗ [3/4] Positive — "command not found" failure
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✗ [3/4] Positive — "command not found" failure
✓ [2/4] Negative — Azure service concept question
✗ [4/4] Positive — "What do I need to install?"
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 50.0% (2/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: no change
Per-Task Breakdown:
• Negative — Editing an ARM template [REGRESSED] 100% → 67% (-33pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [IMPROVED] 0% → 67% (+67pp)
• Positive — "What do I need to install?" [IMPROVED] 67% → 100% (+33pp)
Verdict: Skills have NEUTRAL IMPACT (no net change)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.77 | Duration: 1m15.231s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.57 - 1.00 (σ=0.1839)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ❌ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Editing an ARM template: 67% pass rate, score=0.57±0.00
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
Failed Task Details
Negative — Editing an ARM template
Run 2/3 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.14): Prompt correctly treated as non-trigger (score 0.14 < 0.50)
Positive — "command not found" failure
Run 3/3 (error):
- ❌ answer_quality (0.00): fail: Previous response did not deliver a final answer: The previous assistant turn only ran tool calls (platform detection, check-tools.sh, viewing install-commands.md) with a one-line progress note ("Checking the local prerequisite status and the Linux install guidance now."). It never produced the user-facing summary, so none of the four PASS criteria are met: (1) the tools az/gh/jq/git are not enumerated in a response to the user, (2) no install command for az is presented, (3) no version-verification step is recommended, and (4) no verdict / next step is emitted. The information was gathered but not delivered.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4
Results saved to: .waza-results/prereq-check-gpt-5.4.json
🔢 Tokens (count + profile)
📊 prereq-check: 2,140 tokens (detailed ✓), 10 sections, 2 code blocks
🎯 Quality (5-dim table)
time=2026-06-25T04:49:49.419Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Instructions are exceptionally clear with well-structured tables, numbered steps, explicit status mappings, and platform-specific code blocks. The purpose is immediately obvious from the frontmatter description alone.
completeness █████ Covers tool checks, version thresholds, auth sessions, platform detection, error handling for 8 distinct failure modes, and even edge cases like permission-denied scripts and execution policy restrictions. Nothing obvious is missing.
trigger_precision ████░ USE FOR triggers are rich with concrete error string patterns (e.g., 'az: command not found'), which is excellent for routing. DO NOT USE FOR is definitive but terse — adding one or two anti-example scenarios (e.g., 'do not use to validate ARM templates') would reduce ambiguity at the margin.
scope_coverage █████ Scope is tightly and explicitly bounded: read-only, 4 specific tools, 2 auth sessions, clear handoff to related skills. The 'Never' constraints list and the explicit 'Side effects: Read-only' quick-reference entry leave no ambiguity about boundaries.
anti_patterns ████░ Avoids nearly all anti-patterns: no vague verbs, no conflicting directives, solid error handling table. Minor gap: the step table references external scripts (check-tools.sh, install-commands.md) without a fallback if those files are absent, which could leave the agent stuck in a fresh-clone scenario.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality, production-ready skill definition. It is unusually thorough in error handling, platform coverage, and boundary-setting. The only meaningful improvement would be adding a graceful fallback for missing reference scripts and slightly expanding the DO NOT USE FOR section with concrete counter-examples.
✅ Check (compliance summary) (56 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: prereq-check
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 2140 tokens (hard limit 500)
📐 Spec Compliance: 9/9 checks passed
✅ Meets agentskills.io specification.
📎 Links: 4/4 valid
✅ All links valid.
📊 Token Budget: 2140 / 500 tokens
❌ Exceeds limit by 1640 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 1 reference module(s)
❌ [complexity] Complexity: comprehensive (2140 tokens, 1 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Reduce SKILL.md by 1640 tokens. Run 'waza tokens suggest' for optimization tips
Skill: git-ape-onboarding
📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✓ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.79 | Duration: 38.278s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.56 - 1.00 (σ=0.2022)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq check and auth gate: Criterion 1 FAIL: No prerequisite check was performed or presented. The assistant did not inspect the local environment (no
az --version,gh --version,az account show,gh auth status, or equivalent tool/auth status table). It jumped straight into prescriptive command snippets.
Criterion 2 FAIL: No auth/prereq gate was surfaced. Since no inspection occurred, the assistant could not have surfaced a blocking auth state (e.g., "Azure CLI not authenticated, run az login"). The response assumes auth is ready without verifying.
Criterion 3 PASS: The assistant requested 4 inputs (GitHub repo URL, staging subscription ID, existing App Registration client ID, RBAC role) — meets the ≥3 threshold.
Criterion 4 PASS: Multi-environment awareness is demonstrated — explicitly creates a new federated credential scoped to repo:<org>/<repo>:environment:azure-deploy-staging, names the new GitHub environment azure-deploy-staging, and sets a per-environment AZURE_SUBSCRIPTION_ID variable scoped to that environment.
Overall: Response acted as a "how-to guide + input request" but skipped the gated prereq-inspection step the skill requires before any state-changing flow.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.87): Prompt is trigger-aligned (score 0.87 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-opus-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.json
Model: claude-sonnet-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
✓ [1/4] Negative — Storage service comparison (off-topic)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 880B:564D3:1C47375:1F56037:6A3CB2F6)
✗ [3/4] Positive — Multi-environment onboarding
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.79 | Duration: 35.456s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.56 - 1.00 (σ=0.2022)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — Multi-environment onboarding
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response to grade: There is no visible previous assistant response in this session to evaluate. As a result, none of the four required criteria can be satisfied:
- No prereq check results presented.
- No auth/prereq gate surfaced.
- No input questions asked (need ≥3 of: target repo, staging subscription ID, RBAC role, App Registration reuse decision, env name confirmation, onboarding mode).
- No multi-environment awareness demonstrated (no mention of separate federated credential,
azure-deploy-stagingenv name, SP reuse vs new, or per-env RBAC scoping).
All four criteria are missing.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.87): Prompt is trigger-aligned (score 0.87 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-sonnet-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: CC10:D4778:174BE24:19BB8A4:6A3CB2F3)
✗ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup
✗ [3/4] Positive — Multi-environment onboarding
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.79 | Duration: 46.099s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.56 - 1.00 (σ=0.2022)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ❌ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Storage service comparison (off-topic)
Run 1/1 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.11): Prompt correctly treated as non-trigger (score 0.11 < 0.50)
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing multi-environment awareness (criterion 4): Criteria 1, 2, 3 met: prereq table presented, Azure auth gate explicitly surfaced as blocking, and 5 numbered inputs requested. However criterion 4 (multi-environment awareness) is not satisfied — the response does not (a) mention creating a separate federated-credential entry for staging, (b) name the new
azure-deploy-stagingenvironment, (c) ask about reusing the existing App Registration / SP vs creating a new one for staging isolation, or (d) discuss per-environment secret/RBAC scoping. It treats this like a generic onboarding rather than an additive staging-env onboarding on an already-onboarded repo. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.87): Prompt is trigger-aligned (score 0.87 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.3-codex
Results saved to: .waza-results/git-ape-onboarding-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [2/4] Positive — First-time repo setup
✓ [3/4] Positive — Multi-environment onboarding
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
[ERROR] waiting for session.idle: context deadline exceeded
✗ [2/4] Positive — First-time repo setup
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 100.0% (4/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: +50.0 percentage points
Per-Task Breakdown:
• Negative — Storage service comparison (off-topic) [NEUTRAL] 100% → 100% (+0pp)
• Positive — First-time repo setup [IMPROVED] 0% → 100% (+100pp)
• Positive — Multi-environment onboarding [IMPROVED] 0% → 100% (+100pp)
• Positive — Scaffold honors skip-with-notice on collision [NEUTRAL] 100% → 100% (+0pp)
Verdict: Skills have POSITIVE IMPACT (improved 2/4 tasks)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ✅ Passed | Score: 0.87 | Duration: 36.179s
- Tests: 4 total, 4 passed, 0 failed, 0 errors
- Success Rate: 100.0%
- Score Range: 0.56 - 1.00 (σ=0.1839)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.96 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.4
Results saved to: .waza-results/git-ape-onboarding-gpt-5.4.json
JUnit XML saved to: .waza-results/git-ape-onboarding-gpt-5.4.junit.xml
🔢 Tokens (count + profile)
📊 git-ape-onboarding: 6,667 tokens (detailed ✓), 30 sections, 26 code blocks
⚠️ token count 6667 exceeds 3000
🎯 Quality (5-dim table)
time=2026-06-25T04:50:00.872Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Instructions are exceptionally well-ordered with numbered steps, canonical command examples, and explicit invariants. The two-mode distinction (CI/CD vs enterprise distribution) is clearly separated, and the 'First-turn rule' for agent behavior eliminates ambiguity about when to act vs. gather inputs.
completeness █████ Covers prereqs, auth, multi-env scenarios, OIDC subject format variations, idempotency on re-run, optional drift detection, compliance preferences, landing zone discovery, and enterprise distribution. Edge cases like disabled subscriptions, org OIDC overrides, and collision handling during scaffolding are all explicitly addressed.
trigger_precision ████░ USE FOR and DO NOT USE FOR triggers in the description and 'When to Use' section are well-defined and non-overlapping. Minor gap: 'rotating or updating an existing secret or federated credential' is mentioned as out-of-scope in prose but not in the frontmatter description trigger list, creating slight inconsistency between the two locations.
scope_coverage █████ Scope boundaries are explicit throughout — the enterprise mode clearly states it configures tooling only (not Azure access), UI-only steps are flagged as hand-offs, and the drift detector step is marked optional with a clear dependency explanation. Neither over-broad nor too narrow.
anti_patterns ████░ Avoids nearly all common anti-patterns: no vague instructions, no conflicting directives, good error handling (OIDC mismatch fix, disabled subscription check, partial-failure recovery). Minor issue: the 'Suggested Agent Flow' section partially duplicates the 'Command Playbook' numbered steps, which could cause an agent to second-guess which is authoritative — a brief cross-reference note would resolve this.
────────────────────────────────────────────
Overall: 4.6/5.0
An exceptionally well-crafted skill document. It is production-ready with comprehensive edge-case coverage, strong safety rails (safe-execution rules, invariants, explicit hand-offs for UI-only steps), and clear separation between two distinct modes. The two minor deductions are for slight trigger-list inconsistency between the frontmatter and prose, and minor duplication between the Command Playbook and Suggested Agent Flow that could confuse an agent about the single source of truth for step ordering.
✅ Check (compliance summary) (62 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/git-ape-onboarding/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: git-ape-onboarding
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 6667 tokens (hard limit 500)
📐 Spec Compliance: 9/9 checks passed
✅ Meets agentskills.io specification.
📎 Links: 11/15 valid
⚠️ 4 link issue(s) found.
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-deploy/SKILL.md: target does not exist
❌ [templates/copilot-instructions.md] → website/docs/deployment/state.md: target does not exist
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-destroy/SKILL.md: target does not exist
⚠️ [templates/github-private/README.md] → agents/: target is a directory, not a file
📊 Token Budget: 6667 / 500 tokens
❌ Exceeds limit by 6167 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (6667 tokens, 0 modules)
❌ [negative-delta-risk] Negative delta risk patterns detected: excessive constraints (19 constraint keywords found)
✅ [procedural-content] Description contains procedural language
❌ [over-specificity] Over-specificity detected: absolute Windows paths
❌ [cross-model-density] Advisory 16: word count is 79 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
❌ [progressive-disclosure] Advisory 18: progressive disclosure — SKILL.md body is 525 lines (>500 lines reduces scannability; consider moving detail to references/)
✅ [scope-reduction] Capability scope: 12 signal(s) detected (12 level-2 heading(s), 9 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix 3 broken link(s) — targets do not exist
4. Fix 1 link(s) pointing to directories instead of files
5. Reduce SKILL.md by 6167 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-deploy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: D80B:68C07:1DFC1D2:2100AA2:6A3CB2F7)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: D80B:68C07:1DFC216:2100AE3:6A3CB2FA)
✗ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [3/5] Negative — What-if preview / preflight validation
✓ [5/5] Positive — Re-deploy after template edit
✗ [4/5] Positive — Local deploy of an existing deployment artifact
✗ [1/5] Negative — Destroying / tearing down an existing deployment
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.78 | Duration: 1m26.656s
- Tests: 5 total, 1 passed, 4 failed, 0 errors
- Success Rate: 20.0%
- Score Range: 0.60 - 0.86 (σ=0.0946)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ❌ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Off-topic prompt (Linux kernel scheduling): 50% pass rate, score=0.60±0.00
- Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — Off-topic prompt (Linux kernel scheduling)
Run 1/2 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.20): Prompt correctly treated as non-trigger (score 0.20 < 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Positive — Local deploy of an existing deployment artifact
Run 1/2 (error):
- ❌ answer_quality (0.00): fail: No previous assistant response exists to grade: There is no prior assistant response in this session to evaluate. All four required criteria are missing: (1) no mention of
az stack sub create, (2) no--action-on-unmanage deleteAllflag, (3) no reference to.github/skills/azure-stack-deploy/scripts/deploy-stack.shordeploy-stack.ps1, (4) no mention ofstate.json(schemaVersion 1.0) capturing stack ID and managed resources. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 1401:17320D:1C063DF:1F24675:6A3CB2F9)
✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [4/5] Positive — Local deploy of an existing deployment artifact
✓ [5/5] Positive — Re-deploy after template edit
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [3/5] Negative — What-if preview / preflight validation
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.75 | Duration: 1m12.48s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.86 (σ=0.1167)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.61 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Positive — Local deploy of an existing deployment artifact
Run 1/2 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response exists in this session to grade: There is no previous assistant response in the conversation to evaluate. All four required criteria are therefore missing: (1) no mention of
az stack sub create; (2) no--action-on-unmanage deleteAllflag referenced; (3) no reference to.github/skills/azure-stack-deploy/scripts/deploy-stack.shordeploy-stack.ps1; (4) no mention ofstate.json(schemaVersion 1.0) capturing stack ID and managed resources. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: : Criterion 4 missing: response mentions
state.jsonbut does not specify schemaVersion 1.0 or that it captures stack ID and managed resources. Criteria 1, 2, 3 are met (az stack sub create, --action-on-unmanage deleteAll, and deploy-stack.sh script reference all present). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-stack-deploy-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks
🎯 Quality (5-dim table)
time=2026-06-25T04:51:10.706Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious, steps are well-ordered with numbered procedures, code blocks are clean and consistent across bash/PowerShell, and output expectations are explicit. The 'What to tell the user after running' section eliminates ambiguity about agent response requirements.
completeness █████ Covers prerequisites, arguments, failure modes, state schema, fallback behavior, soft-deletable resource classification, and cross-skill references. Edge cases like race conditions, policy blocks, and missing parameters.json are all addressed.
trigger_precision ████░ USE FOR and DO NOT USE FOR sections are clear and well-separated with explicit anti-cases (destroy, what-if, IaC authoring). Slightly loses a point because the boundary between 'local deploy' and 'CI deploy' could confuse agents — the skill says it matches CI but also implies local-only use.
scope_coverage █████ Scope is tightly defined: subscription-scoped stack creation only, with explicit out-of-scope redirects to three other named skills. Capabilities (stack vs fallback path, state.json writing, metadata update) and limitations (no template generation, no destroy) are explicit.
anti_patterns ████░ Avoids vague instructions and conflicting directives well. The fallback behavior is disclosed with a clear trade-off warning. Minor issue: the idempotency claim ('stacks de-duplicate on --name') in the race condition recovery row is stated without caveats about concurrent deployments, which could mislead in multi-agent scenarios.
────────────────────────────────────────────
Overall: 4.6/5.0
High-quality skill definition with exceptional completeness and clarity. The schema example, failure table, soft-delete tracking, and mandatory post-run reply format are standout elements. Minor deductions for a subtle local-vs-CI scope ambiguity and an uncaveated idempotency claim under concurrent conditions. Ready for production use with minimal revision.
✅ Check (compliance summary) (70 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-deploy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 1912 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 0/8 valid
⚠️ 8 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory
📊 Token Budget: 1912 / 500 tokens
❌ Exceeds limit by 1412 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Description density is optimal for cross-model use
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-destroy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 3501:177763:1D39A50:203F505:6A3CB2FC)
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✓ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.77 | Duration: 1m35.88s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.96 (σ=0.1399)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.96 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing several required PASS criteria: Criterion 1 partial: The response recommends the destroy script, but does NOT explicitly explain that raw
az group deleteis inadequate because it misses soft-delete cleanup and multi-RG resources. In fact, Option C actively suggestsaz group deleteas a workaround without warning about its limitations.
Criterion 3 missing: The response never explicitly mentions az stack sub delete --action-on-unmanage deleteAll or describes the stack-delete semantics (one delete cleans every resource the stack owns). It only refers to the script wrapper.
Criterion 4 missing: The response does not describe the skill's automatic soft-delete purge sweep (Key Vault, Cognitive Services purged after stack delete), nor does it mention that resources flagged purgeProtected: true in state.json are intentionally retained. Option C mentions manual Key Vault purge only as a fallback when bypassing the skill entirely.
Criterion 2 met: state.json prerequisite is clearly called out.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Missing explicit justification for skill vs raw az group delete, and skill's purge sweep behavior: Criterion 1: Recommended the destroy script first, but did not explicitly explain that raw
az group deletemisses soft-delete cleanup and multi-RG/subscription-scoped resources — instead offered it as a casual alternative ("option 2"). Criterion 4: Did not mention the skill's automatic soft-delete purge sweep behavior (Key Vault / Cognitive Services purged after stack delete) nor thepurgeProtected: trueretention semantics; only mentioned manual Key Vault purge as a workaround. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.71 | Duration: 1m4.167s
- Tests: 5 total, 1 passed, 4 failed, 0 errors
- Success Rate: 20.0%
- Score Range: 0.60 - 0.87 (σ=0.1093)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.63 | ❌ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: : Missing criteria 1 (did not explicitly explain that raw
az group deleteis wrong because it misses soft-delete cleanup and multi-RG resources), 3 (did not mentionaz stack sub delete --action-on-unmanage deleteAllor its semantics), and 4 (did not describe the soft-delete purge sweep for Key Vault/Cognitive Services nor thepurgeProtected: trueretention behavior). The response only stated that state.json was missing and gave the script invocation without explaining the destroy semantics. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: : Missing criteria: (1) Did not explicitly explain that raw
az group deleteis wrong because it misses soft-delete cleanup and multi-RG resources — instead even suggestedaz group deleteas a fallback. (3) Did not mentionaz stack sub delete --action-on-unmanage deleteAllor its semantics. (4) Only briefly mentioned "soft-delete purge" in passing without covering Key Vault / Cognitive Services purge behavior orpurgeProtected: trueretention. Only criterion (2) (state.json path requirement) was clearly met. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Positive — Local destroy of a Git-Ape deployment
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: : Missing criteria 3 and 4. The response invoked the destroy script and referenced state.json under .azure/deployments/deploy-20260506-001/, but did not name the stack-delete command
az stack sub delete --action-on-unmanage deleteAllor its semantics, and did not address the soft-delete purge sweep /az keyvault purgefor reusing the Key Vault name. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: : Missing criteria 3 and 4: response did not name
az stack sub delete --action-on-unmanage deleteAllsemantics, and did not explicitly mentionaz keyvault purge/az keyvault list-deletedor describe the purge sweep mechanics for non-purge-protected vaults. Criteria 1 (invoked destroy-stack.sh) and 2 (referenced state.json under .azure/deployments/deploy-20260506-001/) were met. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-stack-destroy-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks
🎯 Quality (5-dim table)
time=2026-06-25T04:51:39.777Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity ████░ Exceptionally well-structured with tables, code examples, and fast vs sync mode comparison. Minor clarity issue: the 'When to Use' section (after DO NOT USE FOR) nearly duplicates the 'USE FOR' section, creating redundant reading and potential confusion about which block is authoritative.
completeness █████ Excellent coverage of prerequisites, procedure steps, all argument flags, failure modes with recovery paths, state.json field semantics, and terminal statuses. Soft-delete purge behavior and purge-protection edge cases are explicitly documented — difficult to find gaps.
trigger_precision ████░ USE FOR and DO NOT USE FOR are precise with concrete user phrases and clear exclusion rationale. However, the duplicate 'When to Use' section after DO NOT USE FOR creates ambiguity about canonical trigger definitions; consolidating them would sharpen routing accuracy.
scope_coverage █████ Scope is tightly and explicitly bounded: Git-Ape deployments only, requires state.json, full-stack teardown only (no surgical mode). The 'Prefer this over raw az group delete' subsection proactively closes a common mis-use path, and limitations are stated without being vague.
anti_patterns ████░ No conflicting directives, strong error-handling guidance, and instructions explain 'why' alongside 'what' — all good. The one notable anti-pattern is the duplicated trigger content ('USE FOR' vs 'When to Use'), which adds noise and could lead an agent to inconsistently weigh the two blocks.
────────────────────────────────────────────
Overall: 4.4/5.0
A high-quality, production-ready skill document with thorough failure-mode coverage, excellent scope definition, and strong prerequisite documentation. The primary improvement opportunity is removing the redundant 'When to Use' section that duplicates 'USE FOR', which would tighten trigger precision and eliminate the only meaningful structural anti-pattern.
✅ Check (compliance summary) (69 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-destroy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 2644 tokens (hard limit 500)
📐 Spec Compliance: 7/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
❌ [spec-security] Security risks detected: description contains XML angle brackets
📎 XML angle brackets and reserved prefixes pose injection and naming conflict risks
📎 Links: 0/4 valid
⚠️ 4 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory
📊 Token Budget: 2644 / 500 tokens
❌ Exceeds limit by 2144 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-landing-zone-discovery
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-landing-zone-discovery-eval
Skill: azure-landing-zone-discovery
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] prompt references relative paths but no workspace files were loaded; use inputs.files to copy fixtures into the sandbox
✗ [4/4] Positive — Manual landing-zone context injection
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: F819:13968D:15A1D10:181EFE0:6A3CB2F9)
✗ [2/4] Negative — CAF naming lookup (off-topic)
✓ [3/4] Positive — Discover the landing zone
[ERROR] waiting for session.idle: context deadline exceeded
✗ [1/4] Negative — Plain function-app deployment (off-topic)
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.67 | Duration: 2m0.102s
- Tests: 4 total, 1 passed, 3 failed, 0 errors
- Success Rate: 25.0%
- Score Range: 0.00 - 0.97 (σ=0.3932)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Plain function-app deployment (off-topic) | 0.77 | ❌ | budget, trigger_relevance_negative |
| Negative — CAF naming lookup (off-topic) | 0.93 | ❌ | budget, trigger_relevance_negative |
| Positive — Discover the landing zone | 0.97 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Manual landing-zone context injection | 0.00 | ❌ | - |
Failed Task Details
Negative — Plain function-app deployment (off-topic)
Run 1/1 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.55): Prompt appears trigger-aligned unexpectedly (score 0.55 >= 0.50)
Negative — CAF naming lookup (off-topic)
Run 1/1 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.86): Prompt appears trigger-aligned unexpectedly (score 0.86 >= 0.50)
Positive — Manual landing-zone context injection
Run 1/1 (error):
Benchmark: azure-landing-zone-discovery-eval | Skill: azure-landing-zone-discovery | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-landing-zone-discovery-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-landing-zone-discovery-eval
Skill: azure-landing-zone-discovery
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] prompt references relative paths but no workspace files were loaded; use inputs.files to copy fixtures into the sandbox
✗ [4/4] Positive — Manual landing-zone context injection
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 1819:2A3231:1D2A535:2043E17:6A3CB2F9)
✗ [2/4] Negative — CAF naming lookup (off-topic)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 181B:290FB7:1B242CF:1E4584F:6A3CB2F4)
✗ [3/4] Positive — Discover the landing zone
[ERROR] waiting for session.idle: context deadline exceeded
✗ [1/4] Negative — Plain function-app deployment (off-topic)
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.59 | Duration: 2m0.059s
- Tests: 4 total, 0 passed, 4 failed, 0 errors
- Success Rate: 0.0%
- Score Range: 0.00 - 0.93 (σ=0.3530)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Plain function-app deployment (off-topic) | 0.77 | ❌ | budget, trigger_relevance_negative |
| Negative — CAF naming lookup (off-topic) | 0.93 | ❌ | budget, trigger_relevance_negative |
| Positive — Discover the landing zone | 0.64 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Manual landing-zone context injection | 0.00 | ❌ | - |
Failed Task Details
Negative — Plain function-app deployment (off-topic)
Run 1/1 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.55): Prompt appears trigger-aligned unexpectedly (score 0.55 >= 0.50)
Negative — CAF naming lookup (off-topic)
Run 1/1 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.86): Prompt appears trigger-aligned unexpectedly (score 0.86 >= 0.50)
Positive — Discover the landing zone
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: Missing all 4 criteria — no prior response to evaluate: No prior assistant response exists in the session to grade. The assistant has not yet responded to the user's landing zone discovery request, so none of the four PASS criteria are met: (1) no reference to the azure-landing-zone-discovery skill or discover-lz.sh, (2) no mention of .azure/landing-zone-context.json output artifact, (3) no mention of management group hierarchy / subscription classification / hub-spoke / policy / shared services, (4) no acknowledgment of permission limitations or inject-lz.sh fallback.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.92): Prompt is trigger-aligned (score 0.92 >= 0.50)
Positive — Manual landing-zone context injection
Run 1/1 (error):
Benchmark: azure-landing-zone-discovery-eval | Skill: azure-landing-zone-discovery | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-landing-zone-discovery-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-landing-zone-discovery: 7,117 tokens (detailed ✓), 27 sections, 17 code blocks
⚠️ token count 7117 exceeds 3000
🎯 Quality (5-dim table)
time=2026-06-25T04:51:58.714Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity ████░ The skill is well-structured with clear section headers, tables, and code examples. However, the procedure mixes discovery steps with inline code that references scripts which may not exist yet, creating ambiguity about whether the agent runs scripts or executes the inline commands directly. Step 8 says 'No manual assembly is needed' but the preceding steps show manual assembly logic.
completeness █████ Exceptionally thorough — covers auto-discovery, manual injection, cross-tenant edge cases, RBAC fallbacks, policy classification, confidence scoring, and downstream integration. The weighted signal table, confidence buckets, edge cases table, and policy effect classification table leave very little to guesswork.
trigger_precision █████ USE FOR and DO NOT USE FOR triggers are crisp, non-overlapping, and include concrete examples with explicit redirects to alternative skills. The boundary between landing-zone topology (this skill) vs. per-resource actions (other skills) is clearly articulated.
scope_coverage ████░ Scope is well-defined with explicit capability boundaries and integration points. Minor gap: the skill doesn't clarify what happens if the discovery scripts don't exist in the repo yet (first-run bootstrap scenario), nor does it specify whether the agent should generate those scripts or expect them pre-installed.
anti_patterns ████░ Avoids most anti-patterns — error handling is explicit, fallbacks are documented, and the manual injection precedence (script > direct edit > questionnaire) is clearly ordered. Minor issue: Step 8 instructs the agent to 'summarize the result back to the user' but doesn't specify a format, leaving output consistency to chance across invocations.
────────────────────────────────────────────
Overall: 4.4/5.0
A high-quality, production-grade skill document. It demonstrates exceptional completeness with its confidence scoring model, weighted signals, edge case tables, and policy effect classification. Trigger precision and anti-pattern avoidance are strong. The main areas for improvement are clarifying whether inline code snippets vs. scripts are the authoritative execution path, and addressing the bootstrap scenario where discovery scripts haven't been installed yet.
✅ Check (compliance summary) (59 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-landing-zone-discovery/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-landing-zone-discovery
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 7117 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, last_updated, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 2/2 valid
✅ All links valid.
📊 Token Budget: 7117 / 500 tokens
❌ Exceeds limit by 6617 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (7117 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
❌ [over-specificity] Over-specificity detected: IP addresses, hardcoded URLs with paths
❌ [cross-model-density] Advisory 16: word count is 65 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
❌ [progressive-disclosure] Advisory 18: progressive disclosure — SKILL.md body is 587 lines (>500 lines reduces scannability; consider moving detail to references/); 1 code block(s) exceed 50 lines (suggest moving to references/)
✅ [scope-reduction] Capability scope: 9 signal(s) detected (9 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, last_updated, user-invocable
4. Reduce SKILL.md by 6617 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-policy-advisor
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-policy-advisor-eval
Skill: azure-policy-advisor
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 4803:1B6C88:1CC97EA:1FD49E0:6A3CB2FC)
✓ [3/5] Negative — Off-topic Linux kernel question
✓ [2/5] Negative — CAF naming lookup (azure-naming-research territory)
✓ [5/5] Positive — Compliance framework audit (CIS)
✗ [1/5] Negative — Pricing / cost estimation (azure-cost-estimator territory)
✓ [4/5] Positive — Post-template-generation policy recommendations
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.78 | Duration: 2m53.766s
- Tests: 5 total, 4 passed, 1 failed, 0 errors
- Success Rate: 80.0%
- Score Range: 0.61 - 0.90 (σ=0.1064)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Pricing / cost estimation (azure-cost-estimator territory) | 0.61 | ❌ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Negative — CAF naming lookup (azure-naming-research territory) | 0.77 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Negative — Off-topic Linux kernel question | 0.75 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Positive — Post-template-generation policy recommendations | 0.90 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Compliance framework audit (CIS) | 0.89 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Pricing / cost estimation (azure-cost-estimator territory): 50% pass rate, score=0.61±0.17
Failed Task Details
Negative — Pricing / cost estimation (azure-cost-estimator territory)
Run 1/2 (error):
- ✅ budget (1.00): All behavior checks passed
- ❌ out_of_scope_acknowledgement (0.00): fail: : No previous assistant response exists in the session to evaluate. Cannot verify that the response (1) avoids policy/governance recommendations and (2) either answers the cost question with retail pricing or routes to the cost-estimation skill.
- ✅ trigger_relevance_negative (0.32): Prompt correctly treated as non-trigger (score 0.32 < 0.50)
Benchmark: azure-policy-advisor-eval | Skill: azure-policy-advisor | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-policy-advisor-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-policy-advisor-eval
Skill: azure-policy-advisor
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers requested
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: BC42:D4778:174C4A9:19BBF77:6A3CB2F9)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: BC44:24A621:16A4364:19188B5:6A3CB2F9)
✗ [4/5] Positive — Post-template-generation policy recommendations
✗ [3/5] Negative — Off-topic Linux kernel question
✗ [5/5] Positive — Compliance framework audit (CIS)
✓ [2/5] Negative — CAF naming lookup (azure-naming-research territory)
✓ [1/5] Negative — Pricing / cost estimation (azure-cost-estimator territory)
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.75 | Duration: 2m29.946s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.72 - 0.77 (σ=0.0201)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Pricing / cost estimation (azure-cost-estimator territory) | 0.77 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Negative — CAF naming lookup (azure-naming-research territory) | 0.77 | ✅ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Negative — Off-topic Linux kernel question | 0.75 | ❌ | budget, out_of_scope_acknowledgement, trigger_relevance_negative |
| Positive — Post-template-generation policy recommendations | 0.73 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Compliance framework audit (CIS) | 0.72 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Off-topic Linux kernel question: 50% pass rate, score=0.75±0.00
- Positive — Post-template-generation policy recommendations: 50% pass rate, score=0.73±0.17
- Positive — Compliance framework audit (CIS): 50% pass rate, score=0.72±0.17
Failed Task Details
Negative — Off-topic Linux kernel question
Run 1/2 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ out_of_scope_acknowledgement (1.00): All prompts passed
- ✅ trigger_relevance_negative (0.24): Prompt correctly treated as non-trigger (score 0.24 < 0.50)
Positive — Post-template-generation policy recommendations
Run 1/2 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response exists in this session to grade: There is no previous assistant response in the conversation to evaluate. The session contains only the user's original request followed immediately by the grading instructions. With no response present:
- Criterion 1 (covers all three resource types): not met — nothing to evaluate
- Criterion 2 (≥2 recognizable built-in Azure Policy checks by name): not met — nothing to evaluate
- Criterion 3 (both template-level fixes AND subscription-level policy assignments tracks): not met — nothing to evaluate
- Criterion 4 (Microsoft Learn link, built-in policy GUID, or named policies with a verification caveat): not met — nothing to evaluate
All four PASS criteria are unmet because no response content exists.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.69): Prompt is trigger-aligned (score 0.69 >= 0.50)
Positive — Compliance framework audit (CIS)
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing verification note for initiative ID: Criterion 1 not fully met: the response recommends assigning the CIS Azure Foundations regulatory compliance initiative, but does not note that the current initiative ID / display name should be verified from Microsoft Learn or via
az policy set-definition listbefore assignment. Criteria 2, 3, and 4 are satisfied (initiative-vs-individual trade-off discussed; audit-first with staged promotion to deny recommended; storage, SQL, VM, and monitoring/diagnostics all covered with named controls). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.67): Prompt is trigger-aligned (score 0.67 >= 0.50)
Benchmark: azure-policy-advisor-eval | Skill: azure-policy-advisor | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-policy-advisor-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-policy-advisor: 5,118 tokens (detailed ✓), 29 sections, 14 code blocks
⚠️ token count 5118 exceeds 3000
🎯 Quality (5-dim table)
time=2026-06-25T04:52:55.201Z level=INFO msg="using Copilot CLI" source=embedded path=/home/runner/.cache/copilot-sdk/copilot_1.0.64-0
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity ████░ Instructions are well-structured with clear step-by-step progression, code blocks, and tables. However, the skill is dense — critical steps like 'read this reference file before proceeding' appear mid-paragraph and could be missed. Consider using callout boxes or bold warnings for mandatory reads.
completeness █████ Exceptional coverage: handles az unavailability, LZ context, confidence gating, management group scope, deduplication logic, output artifacts with schema references, troubleshooting table, and integration points. Edge cases (government clouds, initiative member expansion gap) are explicitly documented.
trigger_precision █████ USE FOR and DO NOT USE FOR sections are precise, non-overlapping, and redirect to specific alternative skills by name. The boundary between this skill and azure-security-analyzer (policy enforcement vs resource config) is clearly articulated and would route correctly.
scope_coverage █████ Scope is explicitly bounded: assesses ARM templates and subscription policy state, does NOT enumerate live deployed resources. Capabilities (two-part report, CLI+ARM implementation options, JSON sidecar) and limitations (initiative member expansion not yet supported) are both documented.
anti_patterns ████░ Avoids most anti-patterns well: error handling is explicit, references are externalized rather than inlined, and the advisory-only policy gate prevents over-prescription. Minor issue: references to external files (references/classification-rules.yaml, references/policy-assessment-template.md) create hidden dependencies — if those files are missing, the agent has no fallback guidance.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality, production-grade skill with exceptional completeness, precise routing triggers, and explicit scope boundaries. The main improvement areas are reducing cognitive load by surfacing mandatory external file reads more prominently, and adding fallback behavior when referenced files (classification-rules.yaml, policy-assessment-template.md) are unavailable.
✅ Check (compliance summary) (57 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-policy-advisor/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-policy-advisor
📋 Compliance Score: High
✅ Excellent! Your skill meets all compliance requirements.
Issues found:
❌ SKILL.md is 5118 tokens (hard limit 500)
📐 Spec Compliance: 9/9 checks passed
✅ Meets agentskills.io specification.
🔌 MCP Integration: 4/4
✅ All MCP integration checks passed.
📎 Links: 5/5 valid
✅ All links valid.
📊 Token Budget: 5118 / 500 tokens
❌ Exceeds limit by 4618 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 3 reference modules (2-3 is optimal)
❌ [complexity] Complexity: comprehensive (5118 tokens, 3 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 113 (>60 may reduce cross-model effectiveness); first sentence doesn't lead with action verb (reduces clarity)
✅ [body-structure] Advisory 17: body structure quality
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 12 signal(s) detected (12 level-2 heading(s), 4 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Reduce SKILL.md by 4618 tokens. Run 'waza tokens suggest' for optimization tips
- add a Do NOT use for block and tighten the description for routing - replace the Step 8 cat stub with assembly summary and report-back guidance - make manual injection prescriptive about inject-lz flags 🧭 - Generated by Copilot
The LZ feature added a Landing Zone Context section to the mirror .github/copilot-instructions.md but not its canonical onboarding template, tripping the template mirror sync CI gate. Port the section into the template so the two stay byte-identical. 🧭 - Generated by Copilot
…-zone-discovery # Conflicts: # .github/evals/manifest.yaml
Resolve conflicts in git-ape-onboarding SKILL.md and docs: keep main's drift-detector Step 10 and Compliance Step 11, renumber Landing Zone Discovery to Step 12, and retain the Enterprise Distribution mode section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
QA against a live canonical Azure Landing Zone tenant showed discover-lz scored it low/30/isLandingZone=false. Four scorer/discovery bugs caused the miss; a fifth broke the documented manual-injection fallback. discover-lz (.sh + .ps1): - Classify Corp/Online MGs via substring fallback. The ALZ accelerator prefixes MG names (e.g. "alz-corp") so the exact-name checks never fire; the fallback had tests for every archetype except corp/online. - Query policy assignments at every discovered management-group scope, not just the subscription. Canonical ALZ policies live at MG scope (alz, alz-platform) and were invisible. Uses default atScope() per MG since --disable-scope-strict-match errors at MG scope. - Refresh the canonical policy-name pattern (Deploy-AzActivity-Log -> Deploy-AzActivityLog, add Deploy-VM-Monitoring, Deploy-VMSS-Monitoring, Deny-Classic-Resources, ...) and match against BOTH .name and .displayName (the canonical token lives in .name; displayName is a long human description). inject-lz (.sh + .ps1): - Emit a landingZoneDetection block (source=manual, confidence high by default) so the manual fallback actually flips LZ-aware behaviour. Adds --confidence/-Confidence and --not-landing-zone/-NotLandingZone. On --merge an explicit confidence overrides; otherwise injection only raises. Live re-run after fixes: medium/55/isLandingZone=true (signals: alz-top-level-mgs 30, alz-lz-archetypes 10, alz-canonical-policies 15). Docs (SKILL.md + generated mirror) document the new flags. shellcheck (severity=warning) and PSScriptAnalyzer (Error+Warning) pass; bash -n and the PowerShell parser pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sendtoshailesh
left a comment
There was a problem hiding this comment.
Review: strong skill, one real bug + parity gaps before merge
Really nice addition — the dual-shell design is well thought out, and I verified a lot of it end-to-end. Approving is one fix (plus two small parity fixes) away.
✅ What I verified
- Confidence scorer parity holds (the headline risk): signal weights
30/20/10/10/5/5/15and thresholds (≥70high /≥40medium /≥10low / else none) are identical betweendiscover-lz.sh:752-774and the.ps1port, and match SKILL.md. inject-lz.shworks end-to-end: create +--mergeproduce valid, well-structured JSON; resource-ID parsing extracts name/subscription correctly; merge preserves existing data (I confirmedpolicies.alzCanonicalAssignmentssurvives a merge).- bash scripts are
bash -n+shellcheck --severity=warningclean;set -euo pipefail; allazcalls are read-only; no command-injection or secret-leak patterns. - Agent wiring uses the canonical
landing-zone-context.jsonpath andazure-landing-zone-discoveryskill name consistently across all 5 agents +copilot-instructions.md. - Generated doc mirrors are in sync (only pre-existing, unrelated
website/docs/workflows/*drift appears on regen); README / intro / sidebars / overview are clean additive entries.
🔴 Blocking — discover-lz.sh crashes with --skip-network
HAS_GRAPH is only assigned inside the network block (discover-lz.sh:491). When --skip-network is passed, the else branch (discover-lz.sh:593-595) never sets it, and the shared-services block then references $HAS_GRAPH under set -u at discover-lz.sh:610 → HAS_GRAPH: unbound variable, script aborts. The .ps1 port initializes $HasGraph = $false up front, so it does not crash — this is both a correctness bug and a cross-shell divergence, on a path that the tests/fixtures/landing-zone/skipped-network.json fixture implies is supported.
Fix: initialize HAS_GRAPH=false alongside the other defaults (near discover-lz.sh:480, before the if [[ "$SKIP_NETWORK" != "true" ]] block).
🟡 Parity gaps — these undercut the "byte-compatible JSON" claim
Since byte-level cross-shell parity is the headline feature, these are worth fixing:
- Injected
evidencestring differs between ports.inject-lz.sh:343emits...asserted via inject-lz.sh (--confidence X)whileinject-lz.ps1:273emits...asserted via inject-lz.ps1 (-Confidence X). The resultinglanding-zone-context.jsonis therefore not byte-identical across shells. Use shell-neutral evidence text. - List parsing diverges. Bash preserves spaces/empty entries (
inject-lz.sh:298,303) while PowerShell trims and drops empties (inject-lz.ps1:238,244), so e.g.--allowed-locations "eastus, westus2"yields different arrays. Normalize (trim + drop-empty) in both ports.
ℹ️ Non-blocking nits (optional)
- Discover array ordering can diverge: bash
unique/unique_by(jq, sorted) vs PSSelect-Object -Unique(first-seen). --output-formatvalidation differs: bash accepts anything and falls back to JSON (discover-lz.sh:60,910); PS restricts tojson|markdown.--not-landing-zone+--confidenceprecedence is parse-order dependent in bash (inject-lz.sh:118-126) vs always-forces-none in PS (inject-lz.ps1:94).- Resource-ID subscription parse is case-sensitive in bash (
inject-lz.sh:190) but case-insensitive in PS (inject-lz.ps1:125). - There's no cross-shell parity smoke CI job for this skill (the onboarding scaffolders have one). Adding
discover/inject .sh vs .ps1 produce identical JSONto CI would prevent silent parity regressions — and would have caught items 1–2 above.
Happy to re-review quickly once the --skip-network crash is fixed and the two parity items are addressed. Great work overall. 🐒
Summary
Adds the
azure-landing-zone-discoveryskill, which auto-discovers enterprise Azure landing zone topology (management groups, platform vs. application subscriptions, policy assignments, hub-spoke networking, and shared services) and writes a machine-readable.azure/landing-zone-context.json. Git-Ape agents read that context to route workloads to the correct subscription, connect to shared services, and avoid policy conflicts.What's included
.github/skills/azure-landing-zone-discovery/SKILL.md(user-invocable), with a weighted ALZ confidence scorer (high/medium/low/none) referencing the ALZ accelerator.discover-lzandinject-lzship in both bash (.sh) and PowerShell (.ps1) parity ports. Both produce a byte-compatiblelanding-zone-context.json, so Windows users withoutgit-bashare first-class.eval.yamlplus 4 tasks (2 positive, 2 negative) registered in.github/evals/manifest.yaml.tests/fixtures/landing-zone/(flat tenant, hub-spoke tenant, skipped-network).git-apeagent, pluscopilot-instructions.md.website/docs/authoring/skills.mdnow documents the dual-shell helper-script rule (user-invocable skills SHOULD ship.sh+.ps1; CI-only skills MAY ship.shonly).landing-zone-context.md), and a use-case walkthrough (landing-zone-aware-deployment.md).Validation
.shports passbash -n; both.ps1ports pass the PowerShell language parser.inject-lz.ps1functionally tested for schema parity vsinject-lz.sh, including cross-shell-Merge(bash creates base → pwsh merges correctly).scripts/generate-docs.js; unrelated drift reverted to keep the PR scoped.🧭 - Generated by Copilot