Skip to content

Raidriar7170/multiagent

Repository files navigation

Verifiable Browser Runtime

Verifiable Browser Runtime (VBR) is a compact contract-first, verifier-first, trace-first reliability runtime for controlled local browser agent tasks.

It turns supported local browser goals into explicit task contracts, routes actions through deterministic guards and adapters, verifies completion with objective success criteria, and records structured traces for failure attribution. It is not a general browser agent, not a public-web automation system, and not a benchmark leaderboard.

Why This Exists

Browser agents can report "done" while the page state is wrong, unsafe, or unreviewable. VBR keeps a small set of controlled local browser tasks inside a contracted runtime loop: plan the task, guard the action, execute locally, verify objectively, and leave trace evidence a reviewer can inspect.

Proof Points

Proof Status Evidence
Contracted local runtime path passed demo-evidence.md
False-success interception failed as expected false-success-summary.md
Bounded verifier-driven retry/replan passed verifier-retry-summary.md
Unsafe action policy block blocked policy-block-summary.md
Static trace summaries tracked trace-summaries/
Optional local vision evidence bounded local vision-screenshot-fixture-smoke.md, real-vision-end-to-end-controlled-task.md

Architecture

flowchart LR
    goal["Supported local goal"] --> planner["ContractPlanner<br/>TaskContract + initial plan"]
    planner --> router["RuleRouter<br/>explicit route decision"]
    router --> policy["PolicyGuard<br/>pre-execution safety gate"]
    policy -->|allowed| executor["PlaywrightExecutor<br/>controlled local actions"]
    policy -->|blocked| trace["TraceWriter<br/>trace.json + summary.md + summary.html + curated evidence"]
    router -. "DOM-insufficient core demo" .-> mock["MockVisionGrounder<br/>adapter contract only"]
    router -. "optional local evidence" .-> buv["browser-use-vision fixture smoke<br/>controlled screenshot + local Florence backend<br/>not core dependency"]
    executor --> verifier["Verifier<br/>objective success conditions"]
    mock --> verifier
    buv --> verifier
    verifier --> trace
Loading

Quick Start

python3 -m pip install -e .[dev]
python3 -m playwright install chromium
vbr-demo all
python3 -m pytest -q
OPENSPEC_TELEMETRY=0 openspec validate --all --strict

Representative demo output:

dom: succeeded -> runs/<run_id>
visual: succeeded -> runs/<run_id>
false-success: failed -> runs/<run_id>
verifier-retry: succeeded -> runs/<run_id>
policy-block: blocked -> runs/<run_id>

Evidence And Boundaries

The optional vision rows are intentionally narrow: they are controlled local fixture and deterministic fake-real task results, not public-web generalization, not universal icon localization, and not benchmark-grade vision performance.

Detailed Evidence Snapshot

Proof point Current outcome What it proves Evidence
Core runtime demo passed Supported local goals can flow through contract planning, routing, guarded execution, objective verification, and trace writing. demo-evidence.md, demo-cli-output.txt
False-success verifier proof failed as expected Executor self-report is not accepted when objective success conditions are missing. false-success-summary.md
Verifier-driven retry proof passed One controlled deterministic verifier-driven retry/replan proof: a no-effect self-report fails verification, one bounded recovery replan runs, and final objective verification passes. verifier-retry-summary.md
Policy-block proof blocked PolicyGuard stops an unsafe submit action before executor execution. policy-block-summary.md
Mock visual fallback route passed The core runtime can route DOM-insufficient targets through a typed vision-grounder adapter path. This is mock adapter routing only. visual-summary.md
Static trace summary tracked Runtime runs now write summary.html as a static trace summary for reviewer-readable contract, route, policy, executor, verifier, retry, failure, and claim-boundary evidence. It is not a trace replay UI or benchmark evidence. trace-summaries/
Optional browser-use-vision + local Florence fixture passed A controlled local screenshot fixture returned a non-mock bbox grounding result through the optional backend-backed adapter path. vision-screenshot-fixture-smoke.md, vision-screenshot-fixture-smoke.json
Optional real-vision end-to-end controlled task passed A deterministic fake-real controlled run passed screenshot bytes to a non-mock provider-shaped grounder, executed a normalized bbox click, and verified the controlled page objective. real-vision-end-to-end-controlled-task.md, real-vision-end-to-end-controlled-task.json

Runtime Handoffs

VBR uses explicit component handoffs rather than a single black-box browser agent. Safety-critical roles are deterministic-first.

Role Runtime object Responsibility
Contract/planning ContractPlanner Converts supported local goals into TaskContract plus an initial plan.
Routing RuleRouter Selects Playwright, mock vision adapter, policy stop, or verifier route.
Safety gate PolicyGuard Blocks forbidden actions, high-risk actions, credential-like entry, and off-allowlist navigation before executor execution.
Browser execution PlaywrightExecutor Executes deterministic local browser actions and captures observations/screenshots.
Visual fallback MockVisionGrounder Demonstrates an adapter path for DOM-insufficient targets; it is not real visual grounding.
Success gate Verifier Checks task-specific success conditions instead of trusting executor self-report.
Evidence TraceWriter Emits trace.json, summary.md, and static summary.html for review and failure attribution.

For developer-facing module paths, handoff objects, and current adapter boundaries, see docs/adapter-interface.md. It documents today's VisionGrounder injection point and current Executor / Verifier shapes without adding a stable plugin registry, production SDK, public-web automation, or benchmark evidence.

Evidence Snippet

core demo outcomes:
  dom: succeeded
  visual: succeeded (mock adapter routing only)
  false-success: failed (button_click_no_effect)
  verifier-retry: succeeded
    first verification: failed (button_click_no_effect)
    retry_decision: retryable_verifier_failure
    replan_created: #real-done
    final verification: passed
    claim: one controlled deterministic verifier-driven retry/replan proof
  policy-block: blocked (unsafe_action_blocked)

optional vision fixture:
  source_status=importable
  smoke_status=passed
  adapter_executed=true
  run_kind=controlled-local-screenshot-fixture
  provider=browser-use-vision
  live_backend=true
  backend_configured=true
  is_mock=false
  method=florence-phrase-grounding
  selected_target_ref=bbox:0.2415,0.3395,0.7615,0.6865
  claim=controlled local screenshot fixture returned non-mock bbox grounding result

optional real-vision controlled task:
  run_kind=deterministic-fake-real-controlled-task
  provider_shape=browser-use-vision
  live_backend=false
  is_mock=false
  method=fake-real-bbox-grounding
  selected_target_ref=bbox:0.0050,0.0700,0.0550,0.1700
  bbox_execution=succeeded
  verification=passed
  claim=optional real-vision end-to-end controlled task exercised the full controlled visual demo chain; not live backend evidence

The tracked evidence records only sanitized fixture metadata. It does not record backend URLs, hosts, IP addresses, credentials, raw screenshot bytes, unredacted screenshot content, backend logs, or absolute local paths.

Public Release Readiness

The repository includes an MIT license and a curated public-release readiness note: docs/evidence/public-release-readiness.md. The v0.1.0 source-review release is documented in docs/release-notes/v0.1.0.md. The tag/GitHub Release publication records the bounded source snapshot after local validation, Reviewer pass, and remote CI success. A manual GitHub visibility change or tag/release publication is repository-display evidence, not runtime capability evidence.

Optional Browser-Use-Vision Adapter

The default runtime remains mock-compatible: vbr-demo all and CI use the mock vision adapter path unless a real adapter is explicitly selected. This keeps the core reliability loop deterministic and avoids requiring optional local packages, credentials, GPU services, backend URLs, or repository secrets.

browser-use-vision and the local Florence backend remain optional local integrations. They are not core dependencies and are not part of required CI. If a compatible adapter is installed, it can be selected explicitly:

VBR_BROWSER_USE_VISION_MODULE=<importable_module> vbr-demo --vision-provider browser-use-vision visual

A local-only smoke command records whether the optional adapter is available:

BUV_SOURCE_PATH=/path/to/browser-use-vision vbr-demo vision-smoke
# or
VBR_BROWSER_USE_VISION_SOURCE_PATH=/path/to/browser-use-vision vbr-demo vision-smoke
# or
VBR_BROWSER_USE_VISION_MODULE=<importable_module> vbr-demo vision-smoke
# or
VBR_BROWSER_USE_VISION_MODULE=<importable_module> vbr-vision-smoke

There is also an opt-in controlled screenshot fixture smoke:

BUV_SOURCE_PATH=/path/to/browser-use-vision vbr-demo vision-fixture-smoke
# optional, only when a local screenshot backend is already running:
VBR_BROWSER_USE_VISION_BACKEND_URL=<local-backend-url> \
  BUV_SOURCE_PATH=/path/to/browser-use-vision vbr-demo vision-fixture-smoke

The tracked smoke artifact currently records a structured local pass: a local browser-use-vision source path was supplied, the package exposed a compatible ground(...) adapter entrypoint, and VBR normalized a non-mock adapter result: docs/evidence/vision-adapter-smoke.md and docs/evidence/vision-adapter-smoke.json. It also records source status separately from smoke status in docs/evidence/vision-source-discovery.json: current source status is importable, current smoke status is passed, and adapter_executed=true. The adapter evidence method is local-contract-smoke, which means the smoke proves the optional entrypoint and contract normalization path, not screenshot-based visual grounding.

The tracked controlled screenshot fixture artifact currently records a bounded local pass: the source is importable, the backend-aware adapter was executed with controlled PNG screenshot bytes, and Florence phrase grounding returned a normalized non-mock bounding-box result: docs/evidence/vision-screenshot-fixture-smoke.md and docs/evidence/vision-screenshot-fixture-smoke.json. The fixture evidence records only backend_configured=true/false and live_backend=true/false; it does not record backend URLs, hosts, IP addresses, credentials, raw screenshot bytes, or backend logs.

The tracked optional real-vision end-to-end controlled task artifact records a deterministic fake-real controlled run through the full controlled visual demo chain: screenshot bytes were passed to a non-mock provider-shaped grounder, provider_shape=browser-use-vision and live_backend=false are recorded, a normalized bbox target was executed by Playwright, and objective verification passed: docs/evidence/real-vision-end-to-end-controlled-task.md and docs/evidence/real-vision-end-to-end-controlled-task.json. This artifact proves the runtime wiring and trace shape for the controlled task. It is not live backend evidence, not a live backend run, and does not make optional vision components required for CI or core runtime use.

This optional path does not claim universal visual grounding, public-web automation, or benchmark-grade visual performance.

Demo Tasks And Commands

The controlled local demo set includes these proof points:

  • dom: completes a DOM-rich form task through Playwright and objective verification.
  • visual: routes a DOM-insufficient target through the mock VisionGrounder adapter path. This proves adapter routing only; it does not claim real visual grounding.
  • false-success: clicks a no-effect button that self-reports completion, then catches the false success through objective verification and failure attribution.
  • verifier-retry: starts from the same false-success failure, records button_click_no_effect, permits one contract-bounded retry decision, creates one deterministic recovery replan for #real-done, and passes final objective verification. This is one controlled deterministic verifier-driven retry/replan proof; it does not prove arbitrary replanning, public-web recovery, benchmark reliability, or production autonomy.
  • policy-block: attempts a forbidden submit action and is blocked by PolicyGuard before an executor result is recorded.

These demo tasks are regression evidence for runtime behavior, not a benchmark leaderboard.

Representative output:

dom: succeeded -> runs/<run_id>
visual: succeeded -> runs/<run_id>
false-success: failed -> runs/<run_id>
verifier-retry: succeeded -> runs/<run_id>
policy-block: blocked -> runs/<run_id>

Curated evidence is tracked in docs/evidence/demo-evidence.md and docs/evidence/demo-cli-output.txt. Curated static trace summary HTML evidence is tracked under docs/evidence/trace-summaries/; these summary.html examples are static reviewer-readable summaries, not a trace replay UI. The false-success evidence shows the key verifier-first proof: executor self-report is not accepted when objective success conditions are missing.

Run them:

PYTHONPATH=src python3 -m verifiable_browser_runtime.cli all

or install the package and use:

vbr-demo all

Render an existing run trace into its static HTML summary:

vbr-trace-summary runs/<run_id>/trace.json

If Playwright browsers are not installed in a clean environment, install the Chromium browser once:

python3 -m playwright install chromium

Run the test suite:

python3 -m pytest -q

Non-Goals And Claim Boundaries

This project intentionally does not claim or implement:

  • a general browser agent;
  • arbitrary public-web automation;
  • a benchmark leaderboard or success-rate competition;
  • benchmark-grade visual performance;
  • universal icon localization or universal visual grounding;
  • production visual support;
  • trace replay UI;
  • arbitrary verifier-driven replanning or production autonomy;
  • voice input;
  • required browser-use-vision or Florence backend dependency in core;
  • required CI dependency on optional vision packages, GPU services, backend URLs, credentials, or local project paths;
  • Stagehand, Browser Use, or Playwright MCP backend integration.

The local pages are controlled proof cases for contract, policy, verification, and trace behavior. The optional visual evidence is limited to controlled local fixture and deterministic fake-real task results that returned or executed non-mock bbox grounding outputs.

CI Validation

The minimal GitHub Actions workflow mirrors the local review gates by running:

python -m pytest -q
openspec validate --all --strict

The workflow also installs Playwright Chromium so browser-backed tests can run on a fresh runner. It is validation-only: it does not publish, deploy, or create releases, and it does not require repository secrets. CI runs deterministic fake-real runtime tests, but it does not run the optional live browser-use-vision / Florence backend fixture evidence path.

About

Contract-first, verifier-first, trace-first browser reliability runtime prototype.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors