Skip to content

DFC302/waybackwhen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

waybackwhen

A multi-source passive URL enumerator that aggregates historical endpoints from the Wayback Machine, Common Crawl, AlienVault OTX, URLScan, VirusTotal, and more — all in one run.

What It Does

Runs five passive tools against a domain and merges the results into a single deduplicated output file. Useful for surface mapping, endpoint discovery, and parameter hunting without touching the target directly.

Sources covered:

Tool Sources
waybackurls Wayback Machine
gau Wayback Machine, Common Crawl, AlienVault OTX, URLScan
waymore Wayback Machine, URLScan, VirusTotal, Common Crawl
urlfinder Wayback Machine, Common Crawl
paramspider Wayback Machine (parameter-focused)

Tool parallelism: gau and urlfinder run in parallel (different backends). waybackurls, waymore, and paramspider run sequentially to avoid hammering the Wayback CDX API simultaneously.

Rate-limit resilience: every tool runs through a retry/backoff wrapper, domains that come back empty are requeued (a common symptom of being throttled), and an optional archive-lane semaphore caps how many domains hit the Wayback/archive tools at once even when overall concurrency is high. See Rate limiting & resilience.


Installation

Dependencies

Install the Go tools:

go install github.com/tomnomnom/waybackurls@latest
go install github.com/lc/gau/v2/cmd/gau@latest
go install github.com/projectdiscovery/urlfinder/cmd/urlfinder@latest

Install waymore:

git clone https://github.com/xnl-h4ck3r/waymore.git ~/tools/waymore
pip install -r ~/tools/waymore/requirements.txt

Install paramspider:

pip install paramspider

Install tldextract (required for --apex flag):

pip install tldextract

Install waybackwhen

git clone https://github.com/DFC302/waybackwhen.git
cd waybackwhen
chmod +x waybackwhen
sudo cp waybackwhen /usr/local/bin/   # optional: add to PATH

Verify your setup

waybackwhen --check

This prints a status table showing which tools are installed and where:

 waybackwhen — tool check
 TOOL            STATUS     PATH
 -----------------------------------------------
  waybackurls    [OK]       /home/user/go/bin/waybackurls
  gau            [OK]       /home/user/go/bin/gau
  urlfinder      [OK]       /home/user/go/bin/urlfinder
  paramspider    [OK]       /usr/local/bin/paramspider
  waymore        [OK]       /home/user/tools/waymore/waymore.py
  python3        [OK]       /usr/bin/python3
  tldextract     [OK]       /usr/lib/python3/dist-packages/tldextract

Missing tools are silently skipped at runtime — the script will not crash if a tool is absent, it just won't contribute results.


Usage

waybackwhen [options] [domain]

Options

Flag Long form Description
-e --exclude Filter out static assets (images, fonts, CSS, JS libraries, etc.) from all results
-x --exact Disable automatic apex extraction and scan the literal input domain (e.g. only api.foo.com, not foo.com)
-c --check Check which tools are installed and where, then exit
-s TOOLS --skip TOOLS Comma-separated list of tools to skip (e.g. waymore or gau,waybackurls)
-f FILE --file FILE Read domains from a file (one per line)
-p N --parallel N Number of domains to process concurrently (default: 1)
-l FILE --log FILE Write a timestamped run log to FILE (.log extension added if missing)
--stdout Print merged URLs to stdout instead of writing .wbw files (status/progress goes to stderr, so output stays pipe-clean)
-r N --retries N Per-tool retries on a hard failure (non-zero exit), with exponential backoff (default: 2)
-q N --requeue N Whole-domain retries when a pass returns zero URLs, with backoff (default: 1). Catches silent throttling, where a tool exits cleanly but returns nothing
-b N --backoff N Base backoff in seconds; the delay for attempt n is N × n (default: 5)
-a N --archive-slots N Max domains allowed to hit the archive lane (waybackurls/waymore/paramspider) at once. 0 = unlimited (default). Clamped to --parallel. Lets you fan out wide with -p without stampeding the Wayback API

Valid tool names for --skip: waybackurls, gau, urlfinder, waymore, paramspider

All numeric flags accept non-negative integers; --parallel must be at least 1.

Input methods

Single domain:

waybackwhen example.com

From file:

waybackwhen -f domains.txt

From stdin:

cat domains.txt | waybackwhen

Examples

Basic run on a single domain:

waybackwhen example.com

Run on a subdomain (apex is extracted automatically by default):

waybackwhen api.example.com
# Strips to example.com and runs all tools against it.
# Prints: [*] Apex extracted: api.example.com → example.com

Scan the literal subdomain only (no apex strip):

waybackwhen --exact api.example.com
# Scans api.example.com directly, does NOT widen to example.com.

Exclude static assets (cleaner output for endpoint hunting):

waybackwhen --exclude example.com

Multiple domains in parallel with logging:

waybackwhen -f domains.txt -p 5 -l run.log

Skip a slow tool for a faster run:

waybackwhen --skip waymore example.com

Skip multiple tools:

waybackwhen --skip gau,waybackurls example.com

Full combination — exclude, skip, parallel, log (apex is the default):

waybackwhen -f subdomains.txt --exclude --skip waymore -p 3 -l hunt.log

Pipe URLs straight into another tool (stdout mode):

waybackwhen --stdout example.com | httpx -silent

Fan out wide but stay gentle on the Wayback API:

# 20 domains in flight, but only 4 hitting the archive tools at any moment
waybackwhen -f domains.txt -p 20 -a 4

Recommended rate-limit-safe profile for large lists:

waybackwhen -f domains.txt -p 20 -a 4 -q 1 -b 8 -l run.log

-p 20 keeps 20 domains in flight (gau/urlfinder run at full width) while -a 4 lets only 4 touch the archive tools at once; -q 1 retries a domain that came back empty (a common throttling symptom) and -b 8 sets the backoff base.

Running on large lists

One thing to know before pointing this at a big file: apexes are not de-duplicated. In the default apex mode, api.foo.com, www.foo.com, and foo.com are three separate jobs that all collapse to foo.com — three full enumerations of the same apex (and three times the archive API calls, all writing to the same foo_com.wbw). On a large subdomain list this multiplies your rate-limit exposure and works against -a. Two ways to avoid it:

Collapse to unique apexes first (each apex runs once):

# requires tldextract (already a dependency)
python3 -c 'import sys,tldextract
for l in sys.stdin:
    l=l.strip()
    if not l: continue
    e=tldextract.extract(l)
    print(getattr(e,"top_domain_under_public_suffix",None) or e.registered_domain)' \
  < subdomains.txt | sort -u > apexes.txt
waybackwhen -f apexes.txt -p 20 -a 4 -q 1 -b 8 -l run.log

Or scan each host literally (no apex collapse, per-subdomain output):

waybackwhen -f subdomains.txt --exact -p 20 -a 4 -q 1 -b 8 -l run.log

For very large lists, also consider --skip waymore for a faster (if slightly less thorough) run, since waymore is the slowest and most rate-limited source.


Output

Each domain produces a .wbw file in the current directory named after the domain (dots replaced with underscores):

example_com.wbw
api_example_com.wbw

Files contain one URL per line, sorted and deduplicated. If a domain returns zero results the output file is deleted automatically.

Example output:

https://example.com/api/v1/users
https://example.com/login?redirect=/dashboard
https://example.com/search?q=FUZZ
https://example.com/wp-login.php

Rate limiting & resilience

Running many domains in parallel (-p) used to mean up to that many waybackurls + waymore processes all hammering web.archive.org at once, with no recovery if any of them got throttled. Three layers address that:

  1. Per-tool retry with backoff (-r, default 2). Each tool runs through a wrapper that re-runs it on a hard failure (non-zero exit), waiting backoff × attempt seconds between tries. This catches crashes and network errors.

  2. Whole-domain requeue on empty (-q, default 1). The archive tools usually exit 0 even when they were rate-limited, so retry-on-error alone misses silent throttling. When a domain's merged result is empty, the entire tool battery is re-run after a backoff. A genuinely empty domain just costs one extra (bounded) backoff cycle before its output is dropped.

  3. Archive-lane semaphore (-a, default unlimited). A counting semaphore caps how many domains hit the archive tools (waybackurls/waymore/paramspider) simultaneously, while gau and urlfinder keep running at full -p. This is the cleanest way to prevent throttling in the first place: e.g. -p 20 -a 4 fans out to 20 domains but lets only 4 touch the archives at a time. The value is clamped to --parallel.

Tune the backoff base with -b (seconds). Set -q 0 to disable requeuing, or -a 0 for the old unlimited behavior.

Notes

  • Default apex extraction and multi-part TLDs: apex extraction (the default) uses tldextract, which handles complex TLDs correctly (api.example.co.ukexample.co.uk). Falls back to a two-label split if tldextract is not installed. Use --exact / -x to bypass and scan the literal input.
  • paramspider output: paramspider replaces parameter values with FUZZ placeholders (e.g., ?id=FUZZ). This is intentional — it surfaces parameter-bearing endpoints cleanly.
  • waymore config: waymore returns significantly more results when configured with API keys for URLScan and VirusTotal. See waymore's README for setup.

About

Passive multi-source URL enumerator — aggregates historical endpoints from Wayback Machine, Common Crawl, AlienVault OTX, URLScan, and VirusTotal

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages