This is the official repository for the paper "WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements".
TL;DR: WebTestPilot converts what a multimodal agent sees on the web into symbolic representations that can be asserted in automated end-to-end tests.
New here? Start with examples/ for step-by-step executions with screenshots, traces, and bug demonstrations.
/baselines # Baseline implementations + test runners
/benchmark # Test cases and injected bugs
/examples # Visual walkthroughs with screenshots, traces, and logs
/experiments # Scripts for RQ1–RQ4 experiments
/webapps # Containerized benchmark applications
/webtestpilot # Core implementation-
Clone and initialize
Run the setup script:
./setup.sh
This checks required tools (
uv,docker,docker-compose) and guides you interactively. -
Configure environment variables
cp .env.example .env
-
Configure runtime settings
Set the provider and execution mode in:
/webtestpilot/src/webtestpilot/config.yamlSupported providers:
Claude(Anthropic)GPT(OpenAI)Gemini(Google)OpenRouter(self-hosted via OpenAI-compatible API)
Notes:
- Ensure corresponding API keys/endpoints for your provider are set in
.env(Step 2). /experimentsuses this config by default (see/baselines/config.pyto override).- For standalone usage, you can provide a custom config path (see example below).
Navigate to:
cd experimentsFollow the README.md in each submodule.
Install as editable package:
pip install -e ./webtestpilot
# or
uv pip install -e ./webtestpilotThe default mode is browser-use: a one-shot LLM agent navigates the browser directly with no GUI grounding model required. The browser must expose a CDP endpoint so browser-use can connect to the existing session.
from webtestpilot import WebTestPilot, Config, BugReport, Session, Step
from playwright.sync_api import sync_playwright
def hook(report: BugReport):
print("A bug was reported:", report)
steps = [
Step(condition="", action="From the dashboard click 'Page Template' link", expectation="Page contains title 'Page Template'"),
Step(condition="", action="Click 'Add Comment'", expectation="A WYSIWYG comment editor is open"),
]
playwright = sync_playwright().start()
# Expose the CDP endpoint so browser-use can connect to the same browser session
browser = playwright.chromium.launch(headless=True, args=["--remote-debugging-port=9222"])
page = browser.new_page()
config = Config.load("path/to/config.yaml")
session = Session(page, config)
WebTestPilot.run(session, steps, assertion=True, hooks=[hook])SoM (Set-of-Mark) mode uses a two-stage grounding pipeline with a local vision model for element localization. This is the configuration used in the paper’s experiments.
Show SoM setup and configuration
To switch to SoM mode, set in config.yaml:
executor:
mode: "som"SoM mode requires deploying inclusionAI/UI-Venus-Ground-7B as a local model server. Install and configure vLLM with:
vllm==0.19.0torch==2.10.0(pinned for ABI compatibility)transformers(custom revision21fac7ab)accelerate>=1.10.0,openai>=1.99.9,pillow>=11.3.0
Then run:
vllm serve inclusionAI/UI-Venus-Ground-7B \
--max_model_len 4K \
--max_num_seqs 8 \
--trust-remote-code \
--limit-mm-per-prompt '{"image": 1, "video": 0}'SoM mode does not require --remote-debugging-port.
@article{teoh2026webtestpilot,
title = {WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements},
author = {Teoh, Xiwen and Lin, Yun and Nguyen, Duc-Minh and Ren, Ruofei and Zhang, Wenjie and Dong, Jin Song},
journal = {Proceedings of the ACM on Software Engineering},
volume = {3},
number = {FSE},
article = {FSE087},
year = {2026},
month = {7},
doi = {10.1145/3797115}
}