WebTestPilot

This is the official repository for the paper "WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements".

TL;DR: WebTestPilot converts what a multimodal agent sees on the web into symbolic representations that can be asserted in automated end-to-end tests.

New here? Start with examples/ for step-by-step executions with screenshots, traces, and bug demonstrations.

📂 Structure

/baselines    # Baseline implementations + test runners
/benchmark    # Test cases and injected bugs
/examples     # Visual walkthroughs with screenshots, traces, and logs
/experiments  # Scripts for RQ1–RQ4 experiments
/webapps      # Containerized benchmark applications
/webtestpilot # Core implementation

⚙️ Setup

Clone and initialize

Run the setup script:
```
./setup.sh
```
This checks required tools (uv, docker, docker-compose) and guides you interactively.
Configure environment variables
```
cp .env.example .env
```
Configure runtime settings

Set the provider and execution mode in:
```
/webtestpilot/src/webtestpilot/config.yaml
```
Supported providers:
- Claude (Anthropic)
- GPT (OpenAI)
- Gemini (Google)
- OpenRouter (self-hosted via OpenAI-compatible API)
Notes:
- Ensure corresponding API keys/endpoints for your provider are set in .env (Step 2).
- /experiments uses this config by default (see /baselines/config.py to override).
- For standalone usage, you can provide a custom config path (see example below).

🚀 Running Experiments

Navigate to:

cd experiments

Follow the README.md in each submodule.

🖥 Running WebTestPilot (Standalone)

Install as editable package:

pip install -e ./webtestpilot
# or
uv pip install -e ./webtestpilot

Minimal example

The default mode is browser-use: a one-shot LLM agent navigates the browser directly with no GUI grounding model required. The browser must expose a CDP endpoint so browser-use can connect to the existing session.

from webtestpilot import WebTestPilot, Config, BugReport, Session, Step
from playwright.sync_api import sync_playwright

def hook(report: BugReport):
    print("A bug was reported:", report)

steps = [
    Step(condition="", action="From the dashboard click 'Page Template' link", expectation="Page contains title 'Page Template'"),
    Step(condition="", action="Click 'Add Comment'", expectation="A WYSIWYG comment editor is open"),
]

playwright = sync_playwright().start()
# Expose the CDP endpoint so browser-use can connect to the same browser session
browser = playwright.chromium.launch(headless=True, args=["--remote-debugging-port=9222"])
page = browser.new_page()

config = Config.load("path/to/config.yaml")
session = Session(page, config)

WebTestPilot.run(session, steps, assertion=True, hooks=[hook])

⚙️ SoM Mode (Optional)

SoM (Set-of-Mark) mode uses a two-stage grounding pipeline with a local vision model for element localization. This is the configuration used in the paper’s experiments.

Show SoM setup and configuration

To switch to SoM mode, set in config.yaml:

executor:
  mode: "som"

SoM mode requires deploying inclusionAI/UI-Venus-Ground-7B as a local model server. Install and configure vLLM with:

vllm==0.19.0
torch==2.10.0 (pinned for ABI compatibility)
transformers (custom revision 21fac7ab)
accelerate>=1.10.0, openai>=1.99.9, pillow>=11.3.0

Then run:

vllm serve inclusionAI/UI-Venus-Ground-7B \
  --max_model_len 4K \
  --max_num_seqs 8 \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image": 1, "video": 0}'

SoM mode does not require --remote-debugging-port.

📝 Citation

@article{teoh2026webtestpilot,
  title   = {WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements},
  author  = {Teoh, Xiwen and Lin, Yun and Nguyen, Duc-Minh and Ren, Ruofei and Zhang, Wenjie and Dong, Jin Song},
  journal = {Proceedings of the ACM on Software Engineering},
  volume  = {3},
  number  = {FSE},
  article = {FSE087},
  year    = {2026},
  month   = {7},
  doi     = {10.1145/3797115}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebTestPilot

📂 Structure

⚙️ Setup

🚀 Running Experiments

🖥 Running WebTestPilot (Standalone)

Minimal example

⚙️ SoM Mode (Optional)

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
baselines		baselines
benchmark		benchmark
examples		examples
experiments		experiments
webapps		webapps
webtestpilot		webtestpilot
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

WebTestPilot

📂 Structure

⚙️ Setup

🚀 Running Experiments

🖥 Running WebTestPilot (Standalone)

Minimal example

⚙️ SoM Mode (Optional)

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages