AliExpress DrissionPage MVP

Local MVP for validating AliExpress product-selection logic without an AliExpress API key.

Setup

python -m venv .venv
.venv\Scripts\python -m pip install -r requirements.txt

LLM Review Setup

Prefer selecting a non-secret LLM profile in the local project .env file:

ALI_MVP_LLM_PROFILE=cheap-review
ALI_MVP_LLM_MODEL=gpt-5.4

The selected profile is read from LLM_PROFILES_PATH or the platform default profile file:

Windows: %USERPROFILE%\.config\llm-profiles\profiles.toml
WSL/Linux: ~/.config/llm-profiles/profiles.toml
Linux server fallback: /etc/llm-profiles/profiles.toml

Each profile stores base_url, model, and either api_key or api_key_env. For a machine-local global profile you can store api_key directly; for repo-shared templates, keep using api_key_env so secrets stay out of versioned files.

Legacy explicit .env values are still supported for temporary overrides:

ALI_MVP_LLM_BASE_URL=https://example.test/v1
ALI_MVP_LLM_API_KEY=sk-example
ALI_MVP_LLM_MODEL=gpt-4.1-mini

Resolution order is: CLI arguments, explicit ALI_MVP_LLM_* values, selected profile, then standard OPENAI_* environment variables.

CLI flags can override config for a single run:

--llm-base-url
--llm-api-key
--llm-model

See docs/llm-profile-config.md for the cross-project Windows/WSL/server setup.

Login

Open AliExpress in the browser profile used by DrissionPage and log in manually before scraping. By default the MVP stores that profile in .browser-profile and uses local port 9333, avoiding conflicts with other Chrome debugging sessions.

Usage

python -m ali_mvp scrape --keyword "women dress" --max-items 80
python -m ali_mvp scrape --keyword "women dress" --max-items 80 --pages 2
python -m ali_mvp scrape --url "https://www.aliexpress.com/..." --max-items 80
python -m ali_mvp scrape --category-url "https://www.aliexpress.com/category/100003109/women-clothing.html" --max-items 80

Browser hardening:

--browser-hardening off|minimal
default: minimal

Proxy and browser identity:

--proxy http://127.0.0.1:8080
--proxy-file proxies.txt
--max-blocks-per-proxy 2
--user-agent "ua-fixed"
--accept-language "en-US,en;q=0.9"

Recommended default for a single-account workflow:

keep one logged-in profile stable
keep one exit path stable
keep one stable browser major version / UA pair per account
do not enable proxy-pool rotation unless you explicitly need a fallback path
if you do not pass --proxy or --proxy-file, the default --proxy-provider manual mode runs without a proxy pool
treat --proxy-provider v2rayn as an opt-in fallback mode, not the default path

v2rayN sidecar proxy pool

Use the local v2rayN installation as a proxy source:

python -m ali_mvp scrape \
  --keyword "Home appliance accessories" \
  --proxy-provider v2rayn \
  --v2rayn-dir "C:\Users\lxy\Desktop\v2rayN-windows-64" \
  --enrich-detail \
  --user-data-dir .browser-profile

Behavior in this phase:

reads nodes from guiConfigs/guiNDB.db -> ProfileItem
generates per-node sidecar xray configs under <run_dir>/proxy_runtime
probes each local socks5 endpoint before opening the browser
picks one healthy endpoint for the current run
attempts to restore the last persisted proxy selection on resume when that proxy is still eligible
proxy health cooldown is fallback memory, not a periodic rotation scheduler
cleans sidecar processes on exit

Current limitations:

no mid-run live proxy hot-swap inside one browser session
no automatic CAPTCHA solving
no adaptive long-term health scoring beyond startup probe and per-run rotation

Pagination semantics:

--max-items is the total number of products requested for the run.
--pages is an optional maximum page limit.
If --pages is omitted, the scraper auto-advances until --max-items is reached or no next page is available.
If you only want the first listing page, pass --pages 1.

Optional detail-page enrichment:

python -m ali_mvp scrape --keyword "women dress" --max-items 20 --enrich-detail

Optional product blacklist filtering:

python -m ali_mvp scrape --keyword "Home appliance accessories" --blacklist-file rules/product_blacklist.json
python -m ali_mvp scrape --keyword "Home appliance accessories" --blacklist-file rules/product_blacklist.json --reject-keyword sensor --reject-keyword relay

Run a standalone LLM review for an existing run:

python -m ali_mvp llm-review --run-dir data/home-appliance-accessories/20260513_151040
python -m ali_mvp llm-review --run-dir data/home-appliance-accessories/20260513_151040 --llm-max-items 5

Chain LLM review after scrape:

python -m ali_mvp scrape --keyword "Home appliance accessories" --max-items 20 --llm-review
python -m ali_mvp scrape --keyword "Home appliance accessories" --max-items 20 --llm-review --llm-force

Resume a blocked run:

python -m ali_mvp resume --run-dir data/home-appliance-accessories/20260511_120000

Retry only unfinished details:

python -m ali_mvp resume --run-dir data/home-appliance-accessories/20260511_120000 --details-only

Resume with temporary proxy or browser identity override:

python -m ali_mvp resume --run-dir data/home-appliance-accessories/20260511_120000 --proxy http://127.0.0.1:8080 --user-agent "ua-fixed" --accept-language "en-US,en;q=0.9"

Notes for resume:

resume attempts to restore the last persisted proxy selection when that proxy is still eligible after health / cooldown filtering
if the persisted proxy is no longer eligible, resume falls back to another eligible proxy
proxy overrides apply when a new browser session is opened for resume
resume does not do live proxy swap inside one browser session after the browser is already open

Detail enrichment adds these columns to products.csv:

entry_type
search_card_url
is_promoted
promo_channel
promotion_text
promo_landing_url
shop_name
shipping_text
detail_rating
detail_review_count
breadcrumb
attributes_text
description_text

Outputs:

data/<keyword-slug>/<YYYYMMDD_HHMMSS>/products.csv
data/<keyword-slug>/<YYYYMMDD_HHMMSS>/products_filter_audit.csv
data/<keyword-slug>/<YYYYMMDD_HHMMSS>/products_review.csv
data/<keyword-slug>/<YYYYMMDD_HHMMSS>/category_rank.csv
data/<keyword-slug>/<YYYYMMDD_HHMMSS>/run_manifest.json
data/<keyword-slug>/<YYYYMMDD_HHMMSS>/run_state.json
data/<keyword-slug>/<YYYYMMDD_HHMMSS>/run_summary.json

For example, --keyword "women dress" writes to:

data/women-dress/20260508_224530/products.csv
data/women-dress/20260508_224530/products_filter_audit.csv
data/women-dress/20260508_224530/products_review.csv
data/women-dress/20260508_224530/category_rank.csv

URL-based runs are grouped under data/url/<YYYYMMDD_HHMMSS>/.

Category URL runs are grouped by the category slug when the URL exposes one:

data/category-women-clothing/20260508_224530/products.csv
data/category-women-clothing/20260508_224530/products_filter_audit.csv
data/category-women-clothing/20260508_224530/products_review.csv
data/category-women-clothing/20260508_224530/category_rank.csv

Postprocess outputs:

python -m ali_mvp postprocess --run-dir data/home-appliance-accessories/20260511_120000

Use MyMemory for free zh translation:

python -m ali_mvp postprocess --run-dir data/home-appliance-accessories/20260511_120000 --translator mymemory

Optional higher-quota hint for MyMemory:

python -m ali_mvp postprocess --run-dir data/home-appliance-accessories/20260511_120000 --translator mymemory --translator-email you@example.com

Additional outputs:

products_zh.csv
products_filter_audit_zh.csv
review_only.csv
products_report.html
translation_cache.json

LLM review outputs:

products_llm_review.csv
products_final_keep.csv
products_final_drop.csv
products_llm_report.html

LLM review behavior:

llm-review reads products_review.csv from one existing run directory
scrape --llm-review only triggers the LLM step after scrape exits with 0 or 2
--llm-force ignores reusable cached rows and re-runs all eligible rows
--llm-max-items only limits the current LLM review batch for debugging

Recommended review workflow for non-technical staff:

Open products_report.html
- Use the built-in filters to switch between:
  - 只看拒绝入库
  - 只看建议入库
  - specific reject reasons such as 遥控控制类 or 点火控制类
Use review_only.csv for spreadsheet review
- This is the compact handoff file for staff
- Key columns:
  - title / title_zh
  - decision_label
  - stage_label
  - review_note
Use products_zh.csv only when more product context is needed
- It keeps the fuller translated dataset for deeper review
Use products_filter_audit_zh.csv when blacklist hit details must be audited
- It retains the rule-hit columns and zh labels

Blacklist filtering semantics:

When blacklist filtering is enabled, --max-items means final accepted product count.
The scraper first runs a listing-title prefilter and skips detail-page visits for obvious blacklist hits.
Remaining products can still be rejected after detail enrichment from title and attributes_text.
breadcrumb and description_text only create warnings.
products.csv only contains accepted products.
products_filter_audit.csv contains all accepted/rejected decisions that were kept for the run and adds filter_stage:
- listing_title
- detail_post_enrich
- accepted

Local-first verification before any live-site validation:

python -m pytest tests/test_filtering.py tests/test_cli.py tests/test_output.py tests/test_browser.py -q

Output Files and Code Map

`products.csv`

Purpose: product-level detail table. One row is one scraped product. Use this file to inspect and filter individual products.

Columns:

source_type: scrape source type, one of keyword, category, or url.
source_value: keyword, category URL, or generic URL used for this run.
title: product title.
price: displayed listing price.
sold_count: parsed sold/order count.
rating: parsed product rating; 0.0 means the listing/detail page did not expose a reliable rating.
review_count: parsed review count; currently often 0 because AliExpress listing cards do not consistently expose it.
product_url: resolved product detail URL. For promo cards, this is the resolved item URL.
search_card_url: original search-card URL before any promo resolution.
image_url: primary image URL.
entry_type: item_card for normal item cards, promo_card for BundleDeals2 / Dollar Express cards.
is_promoted: whether the row came through a promo landing flow.
promo_channel: promo channel name, such as Dollar Express.
promotion_text: flattened promo text such as Free shipping on 3 items | Free returns | Buy more,save more.
promo_landing_url: promo landing page URL for promo cards; empty for normal item cards.
shop_name: store name from the product detail page when --enrich-detail is enabled.
shipping_text: shipping-related text from the product detail page when available.
detail_rating: rating parsed from the product detail page.
detail_review_count: review count parsed from the product detail page.
breadcrumb: flattened breadcrumb text from the product detail page.
attributes_text: JSON string of detail-page attribute key/value pairs.
description_text: cleaned plain-text product description from the detail page.
scraped_at: UTC scrape timestamp.

Promo-card behavior:

Search results may contain Dollar Express / BundleDeals2 cards whose href is not /item/....
The scraper keeps those rows as valid search hits when the card itself carries product content.
For promo rows, the scraper resolves the entry product's real /item/<id>.html URL and stores that in product_url.
The scraper does not expand all products inside the promo landing page; it only follows the entry product and preserves promo metadata.

Code locations:

Schema/dataclass: ali_mvp/scoring.py -> ProductRecord
CSV columns: ali_mvp/output.py -> PRODUCT_FIELDS
CSV writer: ali_mvp/output.py -> write_products_csv()
Raw browser extraction: ali_mvp/browser.py -> PRODUCT_SCRIPT
Raw-to-record normalization: ali_mvp/extractor.py -> normalize_products()
Output path and write call: ali_mvp/cli.py -> run_scrape()

`category_rank.csv`

Purpose: source-level summary table. One row summarizes one scrape source, such as one keyword or one category URL. Use this file to compare whether a keyword/category is worth deeper analysis.

This file is calculated from accepted products only when blacklist filtering is enabled.

Columns:

source_value: keyword, category URL, or generic URL being summarized.
product_count: number of normalized products in this run.
total_sold_count: sum of sold_count.
avg_rating: average rating.
avg_review_count: average review_count.
heat_score: simple ranking score for quick comparison.

Current heat score formula:

heat_score = total_sold_count + total_review_count + product_count * 10 + avg_rating * 10

Code locations:

Schema/dataclass: ali_mvp/scoring.py -> RankRecord
Aggregation and formula: ali_mvp/scoring.py -> aggregate_rank() and _build_rank()
CSV columns: ali_mvp/output.py -> RANK_FIELDS
CSV writer: ali_mvp/output.py -> write_rank_csv()
Output path and write call: ali_mvp/cli.py -> run_scrape()

`products_filter_audit.csv`

Purpose: filtering audit table. This file records the accepted/rejected decisions kept for the current run, including rejected rows from the listing_title prefilter stage before detail enrichment. Because of that prefilter stage, rows do not only correspond to normalized products.

Columns:

source_type: scrape source type.
source_value: keyword, category URL, or generic URL used for this run.
title: product title.
product_url: resolved product detail URL.
filter_decision: accepted or rejected.
filter_stage: decision stage for the row:
- listing_title
- detail_post_enrich
- accepted
reject_groups: matched blacklist group names from strong fields.
reject_terms: matched blacklist terms from strong fields.
reject_fields: strong fields that triggered rejection.
warning_groups: matched blacklist group names from weak fields.
warning_terms: matched blacklist terms from weak fields.
warning_fields: weak fields that produced warnings.

Code locations:

Filter engine: ali_mvp/filtering.py
CSV columns: ali_mvp/output.py -> FILTER_AUDIT_FIELDS
CSV writer: ali_mvp/output.py -> write_filter_audit_csv()
Output path and write call: ali_mvp/cli.py -> run_scrape()

`products_review.csv`

Purpose: review-oriented table for accepted and rejected rows with enough product context to audit blacklist decisions quickly.

Columns:

source_type
source_value
title
product_url
image_url
price
search_card_url
entry_type
is_promoted
promo_channel
promotion_text
shop_name
shipping_text
attributes_text
description_text
detail_status
filter_decision
filter_stage
reject_groups
reject_terms
reject_fields
warning_groups
warning_terms
warning_fields

Code locations:

Review row join: ali_mvp/review.py -> build_review_rows()
CSV columns: ali_mvp/output.py -> REVIEW_FIELDS
Output path and write call: ali_mvp/cli.py -> run_scrape()

LLM review artifacts

Run with either:

python -m ali_mvp llm-review --run-dir data/home-appliance-accessories/20260513_151040
python -m ali_mvp scrape --keyword "Home appliance accessories" --max-items 20 --llm-review

Generated files:

products_llm_review.csv
- full LLM review table for all processed products_review.csv rows
- keeps product context, rule-layer context, LLM decision, risk tags, summary, model, prompt version, input hash, and any row-level error
products_final_keep.csv
- subset where llm_decision == keep
products_final_drop.csv
- subset where llm_decision == drop
products_llm_report.html
- HTML review page grouped into keep / drop / error

Code locations:

Config resolution and OpenAI-compatible client: ali_mvp/llm_client.py
Review orchestration, reuse, keep/drop slicing: ali_mvp/llm_review.py
HTML rendering: ali_mvp/llm_reporting.py
CLI entrypoints: ali_mvp/cli.py -> run_llm_review() and run_scrape()

Postprocess artifacts

python -m ali_mvp postprocess --run-dir ... reads the scrape outputs in one run directory and generates:

products_review.csv
products_zh.csv
products_filter_audit_zh.csv
review_only.csv
products_report.html
translation_cache.json

Suggested reviewer usage:

products_report.html
- visual review page
- best for quick pass/fail inspection and reason filtering
review_only.csv
- smallest handoff file for staff
- sorted for manual review with rejected rows first
products_zh.csv
- fuller translated product dataset
products_filter_audit_zh.csv
- full blacklist audit trail with zh labels

Translator options:

--translator identity|mymemory
--translator-email you@example.com for optional MyMemory de parameter

Code locations:

Orchestration: ali_mvp/postprocess.py -> run_postprocess_for_dir()
HTML rendering: ali_mvp/reporting.py -> render_report_html()
Translation/cache: ali_mvp/translation.py

Limitations

This MVP is for low-frequency validation. It now supports a minimal sequential proxy pool and fixed browser identity per run, but it still does not handle automated CAPTCHA solving, account pools, checkout, or official AliExpress API access.

Current anti-risk status:

Done in this phase:
- session preflight + warm-up
- session risk persistence
- proxy health / cooldown
- browser identity warning
- optional browser pacing / stealth hardening via --browser-hardening off|minimal
- single-proxy or proxy-file based sequential rotation via --proxy, --proxy-file, and --max-blocks-per-proxy
- fixed browser identity per run via --user-agent and --accept-language
- preflight stops the run before scraping when AliExpress is on login, phone verification, or captcha pages
- captcha page detection
- manual captcha wait-and-resume flow
- graceful detail-status fallback when captcha is not cleared
Not done in this phase:
- automatic slider / captcha solving
- aggressive header / fingerprint pool rotation
- proxy health scoring or adaptive pool management
- fully automated recovery under sustained risk-control pressure
- live proxy swap inside one browser session

Manual Validation

After logging in, run:

python -m ali_mvp scrape --keyword "Home appliance accessories" --max-items 20 --enrich-detail --blacklist-file rules/product_blacklist.json --user-data-dir .browser-profile

If the run is blocked, clear the CAPTCHA manually in the same profile and then resume:

python -m ali_mvp resume --run-dir data/home-appliance-accessories/<timestamp>

If no products are extracted, open the browser window and check for region selection, CAPTCHA, cookie banners, or page layout changes.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.claude/skills		.claude/skills
ali_mvp		ali_mvp
config		config
docs		docs
rules		rules
tests		tests
tools		tools
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AliExpress DrissionPage MVP

Setup

LLM Review Setup

Login

Usage

v2rayN sidecar proxy pool

Output Files and Code Map

`products.csv`

`category_rank.csv`

`products_filter_audit.csv`

`products_review.csv`

LLM review artifacts

Postprocess artifacts

Limitations

Manual Validation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AliExpress DrissionPage MVP

Setup

LLM Review Setup

Login

Usage

v2rayN sidecar proxy pool

Output Files and Code Map

products.csv

category_rank.csv

products_filter_audit.csv

products_review.csv

LLM review artifacts

Postprocess artifacts

Limitations

Manual Validation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`products.csv`

`category_rank.csv`

`products_filter_audit.csv`

`products_review.csv`

Packages