Local MVP for validating AliExpress product-selection logic without an AliExpress API key.
python -m venv .venv
.venv\Scripts\python -m pip install -r requirements.txtPrefer selecting a non-secret LLM profile in the local project .env file:
ALI_MVP_LLM_PROFILE=cheap-review
ALI_MVP_LLM_MODEL=gpt-5.4The selected profile is read from LLM_PROFILES_PATH or the platform default profile file:
- Windows:
%USERPROFILE%\.config\llm-profiles\profiles.toml - WSL/Linux:
~/.config/llm-profiles/profiles.toml - Linux server fallback:
/etc/llm-profiles/profiles.toml
Each profile stores base_url, model, and either api_key or api_key_env.
For a machine-local global profile you can store api_key directly; for repo-shared templates, keep using api_key_env so secrets stay out of versioned files.
Legacy explicit .env values are still supported for temporary overrides:
ALI_MVP_LLM_BASE_URL=https://example.test/v1
ALI_MVP_LLM_API_KEY=sk-example
ALI_MVP_LLM_MODEL=gpt-4.1-miniResolution order is: CLI arguments, explicit ALI_MVP_LLM_* values, selected profile, then standard OPENAI_* environment variables.
CLI flags can override config for a single run:
--llm-base-url--llm-api-key--llm-model
See docs/llm-profile-config.md for the cross-project Windows/WSL/server setup.
Open AliExpress in the browser profile used by DrissionPage and log in manually before scraping.
By default the MVP stores that profile in .browser-profile and uses local port 9333, avoiding conflicts with other Chrome debugging sessions.
python -m ali_mvp scrape --keyword "women dress" --max-items 80
python -m ali_mvp scrape --keyword "women dress" --max-items 80 --pages 2
python -m ali_mvp scrape --url "https://www.aliexpress.com/..." --max-items 80
python -m ali_mvp scrape --category-url "https://www.aliexpress.com/category/100003109/women-clothing.html" --max-items 80Browser hardening:
--browser-hardening off|minimal- default:
minimal
Proxy and browser identity:
--proxy http://127.0.0.1:8080--proxy-file proxies.txt--max-blocks-per-proxy 2--user-agent "ua-fixed"--accept-language "en-US,en;q=0.9"
Recommended default for a single-account workflow:
- keep one logged-in profile stable
- keep one exit path stable
- keep one stable browser major version / UA pair per account
- do not enable proxy-pool rotation unless you explicitly need a fallback path
- if you do not pass
--proxyor--proxy-file, the default--proxy-provider manualmode runs without a proxy pool - treat
--proxy-provider v2raynas an opt-in fallback mode, not the default path
Use the local v2rayN installation as a proxy source:
python -m ali_mvp scrape \
--keyword "Home appliance accessories" \
--proxy-provider v2rayn \
--v2rayn-dir "C:\Users\lxy\Desktop\v2rayN-windows-64" \
--enrich-detail \
--user-data-dir .browser-profileBehavior in this phase:
- reads nodes from
guiConfigs/guiNDB.db -> ProfileItem - generates per-node sidecar
xrayconfigs under<run_dir>/proxy_runtime - probes each local socks5 endpoint before opening the browser
- picks one healthy endpoint for the current run
- attempts to restore the last persisted proxy selection on
resumewhen that proxy is still eligible - proxy health cooldown is fallback memory, not a periodic rotation scheduler
- cleans sidecar processes on exit
Current limitations:
- no mid-run live proxy hot-swap inside one browser session
- no automatic CAPTCHA solving
- no adaptive long-term health scoring beyond startup probe and per-run rotation
Pagination semantics:
--max-itemsis the total number of products requested for the run.--pagesis an optional maximum page limit.- If
--pagesis omitted, the scraper auto-advances until--max-itemsis reached or no next page is available. - If you only want the first listing page, pass
--pages 1.
Optional detail-page enrichment:
python -m ali_mvp scrape --keyword "women dress" --max-items 20 --enrich-detailOptional product blacklist filtering:
python -m ali_mvp scrape --keyword "Home appliance accessories" --blacklist-file rules/product_blacklist.json
python -m ali_mvp scrape --keyword "Home appliance accessories" --blacklist-file rules/product_blacklist.json --reject-keyword sensor --reject-keyword relayRun a standalone LLM review for an existing run:
python -m ali_mvp llm-review --run-dir data/home-appliance-accessories/20260513_151040
python -m ali_mvp llm-review --run-dir data/home-appliance-accessories/20260513_151040 --llm-max-items 5Chain LLM review after scrape:
python -m ali_mvp scrape --keyword "Home appliance accessories" --max-items 20 --llm-review
python -m ali_mvp scrape --keyword "Home appliance accessories" --max-items 20 --llm-review --llm-forceResume a blocked run:
python -m ali_mvp resume --run-dir data/home-appliance-accessories/20260511_120000Retry only unfinished details:
python -m ali_mvp resume --run-dir data/home-appliance-accessories/20260511_120000 --details-onlyResume with temporary proxy or browser identity override:
python -m ali_mvp resume --run-dir data/home-appliance-accessories/20260511_120000 --proxy http://127.0.0.1:8080 --user-agent "ua-fixed" --accept-language "en-US,en;q=0.9"Notes for resume:
resumeattempts to restore the last persisted proxy selection when that proxy is still eligible after health / cooldown filtering- if the persisted proxy is no longer eligible,
resumefalls back to another eligible proxy - proxy overrides apply when a new browser session is opened for
resume resumedoes not do live proxy swap inside one browser session after the browser is already open
Detail enrichment adds these columns to products.csv:
entry_typesearch_card_urlis_promotedpromo_channelpromotion_textpromo_landing_urlshop_nameshipping_textdetail_ratingdetail_review_countbreadcrumbattributes_textdescription_text
Outputs:
data/<keyword-slug>/<YYYYMMDD_HHMMSS>/products.csvdata/<keyword-slug>/<YYYYMMDD_HHMMSS>/products_filter_audit.csvdata/<keyword-slug>/<YYYYMMDD_HHMMSS>/products_review.csvdata/<keyword-slug>/<YYYYMMDD_HHMMSS>/category_rank.csvdata/<keyword-slug>/<YYYYMMDD_HHMMSS>/run_manifest.jsondata/<keyword-slug>/<YYYYMMDD_HHMMSS>/run_state.jsondata/<keyword-slug>/<YYYYMMDD_HHMMSS>/run_summary.json
For example, --keyword "women dress" writes to:
data/women-dress/20260508_224530/products.csv
data/women-dress/20260508_224530/products_filter_audit.csv
data/women-dress/20260508_224530/products_review.csv
data/women-dress/20260508_224530/category_rank.csv
URL-based runs are grouped under data/url/<YYYYMMDD_HHMMSS>/.
Category URL runs are grouped by the category slug when the URL exposes one:
data/category-women-clothing/20260508_224530/products.csv
data/category-women-clothing/20260508_224530/products_filter_audit.csv
data/category-women-clothing/20260508_224530/products_review.csv
data/category-women-clothing/20260508_224530/category_rank.csv
Postprocess outputs:
python -m ali_mvp postprocess --run-dir data/home-appliance-accessories/20260511_120000Use MyMemory for free zh translation:
python -m ali_mvp postprocess --run-dir data/home-appliance-accessories/20260511_120000 --translator mymemoryOptional higher-quota hint for MyMemory:
python -m ali_mvp postprocess --run-dir data/home-appliance-accessories/20260511_120000 --translator mymemory --translator-email you@example.comAdditional outputs:
products_zh.csvproducts_filter_audit_zh.csvreview_only.csvproducts_report.htmltranslation_cache.json
LLM review outputs:
products_llm_review.csvproducts_final_keep.csvproducts_final_drop.csvproducts_llm_report.html
LLM review behavior:
llm-reviewreadsproducts_review.csvfrom one existing run directoryscrape --llm-reviewonly triggers the LLM step after scrape exits with0or2--llm-forceignores reusable cached rows and re-runs all eligible rows--llm-max-itemsonly limits the current LLM review batch for debugging
Recommended review workflow for non-technical staff:
- Open
products_report.html- Use the built-in filters to switch between:
只看拒绝入库只看建议入库- specific reject reasons such as
遥控控制类or点火控制类
- Use the built-in filters to switch between:
- Use
review_only.csvfor spreadsheet review- This is the compact handoff file for staff
- Key columns:
title/title_zhdecision_labelstage_labelreview_note
- Use
products_zh.csvonly when more product context is needed- It keeps the fuller translated dataset for deeper review
- Use
products_filter_audit_zh.csvwhen blacklist hit details must be audited- It retains the rule-hit columns and zh labels
Blacklist filtering semantics:
- When blacklist filtering is enabled,
--max-itemsmeans final accepted product count. - The scraper first runs a listing-title prefilter and skips detail-page visits for obvious blacklist hits.
- Remaining products can still be rejected after detail enrichment from
titleandattributes_text. breadcrumbanddescription_textonly create warnings.products.csvonly contains accepted products.products_filter_audit.csvcontains all accepted/rejected decisions that were kept for the run and addsfilter_stage:listing_titledetail_post_enrichaccepted
Local-first verification before any live-site validation:
python -m pytest tests/test_filtering.py tests/test_cli.py tests/test_output.py tests/test_browser.py -qPurpose: product-level detail table. One row is one scraped product. Use this file to inspect and filter individual products.
Columns:
source_type: scrape source type, one ofkeyword,category, orurl.source_value: keyword, category URL, or generic URL used for this run.title: product title.price: displayed listing price.sold_count: parsed sold/order count.rating: parsed product rating;0.0means the listing/detail page did not expose a reliable rating.review_count: parsed review count; currently often0because AliExpress listing cards do not consistently expose it.product_url: resolved product detail URL. For promo cards, this is the resolved item URL.search_card_url: original search-card URL before any promo resolution.image_url: primary image URL.entry_type:item_cardfor normal item cards,promo_cardforBundleDeals2 / Dollar Expresscards.is_promoted: whether the row came through a promo landing flow.promo_channel: promo channel name, such asDollar Express.promotion_text: flattened promo text such asFree shipping on 3 items | Free returns | Buy more,save more.promo_landing_url: promo landing page URL for promo cards; empty for normal item cards.shop_name: store name from the product detail page when--enrich-detailis enabled.shipping_text: shipping-related text from the product detail page when available.detail_rating: rating parsed from the product detail page.detail_review_count: review count parsed from the product detail page.breadcrumb: flattened breadcrumb text from the product detail page.attributes_text: JSON string of detail-page attribute key/value pairs.description_text: cleaned plain-text product description from the detail page.scraped_at: UTC scrape timestamp.
Promo-card behavior:
- Search results may contain
Dollar Express / BundleDeals2cards whose href is not/item/.... - The scraper keeps those rows as valid search hits when the card itself carries product content.
- For promo rows, the scraper resolves the entry product's real
/item/<id>.htmlURL and stores that inproduct_url. - The scraper does not expand all products inside the promo landing page; it only follows the entry product and preserves promo metadata.
Code locations:
- Schema/dataclass:
ali_mvp/scoring.py->ProductRecord - CSV columns:
ali_mvp/output.py->PRODUCT_FIELDS - CSV writer:
ali_mvp/output.py->write_products_csv() - Raw browser extraction:
ali_mvp/browser.py->PRODUCT_SCRIPT - Raw-to-record normalization:
ali_mvp/extractor.py->normalize_products() - Output path and write call:
ali_mvp/cli.py->run_scrape()
Purpose: source-level summary table. One row summarizes one scrape source, such as one keyword or one category URL. Use this file to compare whether a keyword/category is worth deeper analysis.
This file is calculated from accepted products only when blacklist filtering is enabled.
Columns:
source_value: keyword, category URL, or generic URL being summarized.product_count: number of normalized products in this run.total_sold_count: sum ofsold_count.avg_rating: averagerating.avg_review_count: averagereview_count.heat_score: simple ranking score for quick comparison.
Current heat score formula:
heat_score = total_sold_count + total_review_count + product_count * 10 + avg_rating * 10
Code locations:
- Schema/dataclass:
ali_mvp/scoring.py->RankRecord - Aggregation and formula:
ali_mvp/scoring.py->aggregate_rank()and_build_rank() - CSV columns:
ali_mvp/output.py->RANK_FIELDS - CSV writer:
ali_mvp/output.py->write_rank_csv() - Output path and write call:
ali_mvp/cli.py->run_scrape()
Purpose: filtering audit table. This file records the accepted/rejected decisions kept for the current run, including rejected rows from the listing_title prefilter stage before detail enrichment. Because of that prefilter stage, rows do not only correspond to normalized products.
Columns:
source_type: scrape source type.source_value: keyword, category URL, or generic URL used for this run.title: product title.product_url: resolved product detail URL.filter_decision:acceptedorrejected.filter_stage: decision stage for the row:listing_titledetail_post_enrichaccepted
reject_groups: matched blacklist group names from strong fields.reject_terms: matched blacklist terms from strong fields.reject_fields: strong fields that triggered rejection.warning_groups: matched blacklist group names from weak fields.warning_terms: matched blacklist terms from weak fields.warning_fields: weak fields that produced warnings.
Code locations:
- Filter engine:
ali_mvp/filtering.py - CSV columns:
ali_mvp/output.py->FILTER_AUDIT_FIELDS - CSV writer:
ali_mvp/output.py->write_filter_audit_csv() - Output path and write call:
ali_mvp/cli.py->run_scrape()
Purpose: review-oriented table for accepted and rejected rows with enough product context to audit blacklist decisions quickly.
Columns:
source_typesource_valuetitleproduct_urlimage_urlpricesearch_card_urlentry_typeis_promotedpromo_channelpromotion_textshop_nameshipping_textattributes_textdescription_textdetail_statusfilter_decisionfilter_stagereject_groupsreject_termsreject_fieldswarning_groupswarning_termswarning_fields
Code locations:
- Review row join:
ali_mvp/review.py->build_review_rows() - CSV columns:
ali_mvp/output.py->REVIEW_FIELDS - Output path and write call:
ali_mvp/cli.py->run_scrape()
Run with either:
python -m ali_mvp llm-review --run-dir data/home-appliance-accessories/20260513_151040
python -m ali_mvp scrape --keyword "Home appliance accessories" --max-items 20 --llm-reviewGenerated files:
products_llm_review.csv- full LLM review table for all processed
products_review.csvrows - keeps product context, rule-layer context, LLM decision, risk tags, summary, model, prompt version, input hash, and any row-level error
- full LLM review table for all processed
products_final_keep.csv- subset where
llm_decision == keep
- subset where
products_final_drop.csv- subset where
llm_decision == drop
- subset where
products_llm_report.html- HTML review page grouped into keep / drop / error
Code locations:
- Config resolution and OpenAI-compatible client:
ali_mvp/llm_client.py - Review orchestration, reuse, keep/drop slicing:
ali_mvp/llm_review.py - HTML rendering:
ali_mvp/llm_reporting.py - CLI entrypoints:
ali_mvp/cli.py->run_llm_review()andrun_scrape()
python -m ali_mvp postprocess --run-dir ... reads the scrape outputs in one run directory and generates:
products_review.csvproducts_zh.csvproducts_filter_audit_zh.csvreview_only.csvproducts_report.htmltranslation_cache.json
Suggested reviewer usage:
products_report.html- visual review page
- best for quick pass/fail inspection and reason filtering
review_only.csv- smallest handoff file for staff
- sorted for manual review with rejected rows first
products_zh.csv- fuller translated product dataset
products_filter_audit_zh.csv- full blacklist audit trail with zh labels
Translator options:
--translator identity|mymemory--translator-email you@example.comfor optional MyMemorydeparameter
Code locations:
- Orchestration:
ali_mvp/postprocess.py->run_postprocess_for_dir() - HTML rendering:
ali_mvp/reporting.py->render_report_html() - Translation/cache:
ali_mvp/translation.py
This MVP is for low-frequency validation. It now supports a minimal sequential proxy pool and fixed browser identity per run, but it still does not handle automated CAPTCHA solving, account pools, checkout, or official AliExpress API access.
Current anti-risk status:
- Done in this phase:
- session preflight + warm-up
- session risk persistence
- proxy health / cooldown
- browser identity warning
- optional browser pacing / stealth hardening via
--browser-hardening off|minimal - single-proxy or proxy-file based sequential rotation via
--proxy,--proxy-file, and--max-blocks-per-proxy - fixed browser identity per run via
--user-agentand--accept-language - preflight stops the run before scraping when AliExpress is on login, phone verification, or captcha pages
- captcha page detection
- manual captcha wait-and-resume flow
- graceful detail-status fallback when captcha is not cleared
- Not done in this phase:
- automatic slider / captcha solving
- aggressive header / fingerprint pool rotation
- proxy health scoring or adaptive pool management
- fully automated recovery under sustained risk-control pressure
- live proxy swap inside one browser session
After logging in, run:
python -m ali_mvp scrape --keyword "Home appliance accessories" --max-items 20 --enrich-detail --blacklist-file rules/product_blacklist.json --user-data-dir .browser-profileIf the run is blocked, clear the CAPTCHA manually in the same profile and then resume:
python -m ali_mvp resume --run-dir data/home-appliance-accessories/<timestamp>If no products are extracted, open the browser window and check for region selection, CAPTCHA, cookie banners, or page layout changes.