HCG is a standalone HeartBioPortal module for preparing cardiovascular guideline PDFs and converting them into structured JSON artifacts.
HCG supports the HBP 3.0 ecosystem by serving as the clinical guideline extraction resource upstream of the HeartBioPortal guideline dossier and the HCG-KG knowledge graph. It discovers or stages guideline source PDFs, prepares page images, extracts structured page-level JSON, validates gene symbols, and builds guideline release artifacts that downstream HBP services can transform into gene-first guideline context.
The repository now centers on the locally refreshed guideline corpus while retaining the dataset-aware scraper/sync pipeline for future upstream checks:
acc_ahaACC guideline discovery onacc.org, with browser-backed PDF resolution for JACC-hosted files.escESC guideline discovery onescardio.org, with article-first capture from the linked Oxford Academic guideline pages.
It also supports a staged rerun workflow for locally downloaded PDFs. This is the preferred path for
the data/ESC-NEW and data/AHA-ACC-NEW folders because it separates cheap page-image preparation
from later OpenAI vision extraction.
src/hcgPython package with the scraper, OpenAI extractor, release builder, schemas, and CLI.data/reference/gene_names.jsonCanonical gene reference used during normalization.data/ESC-NEWanddata/AHA-ACC-NEWNewly downloaded local rerun corpus. These are treated as source inputs, not OpenAI outputs.data/prepared_images/rerun_2026_05_18_api_readyPage images rendered from the new local PDFs, with one directory and manifest per source PDF.docs/project_audit.mdCurrent project audit and remaining caveats.
- The old checked-in ACC/AHA and ESC raw extraction folders have been removed.
- The refreshed local corpus contains
42PDFs:22ACC/AHA PDFs and20ESC PDFs. - The image-only preparation stage rendered
4,290page images with0failures underdata/prepared_images/rerun_2026_05_18_api_ready. - The rendered page images are intentionally ignored by git; regenerate them locally with
hcg prepare-imageswhen needed. - OpenAI extraction and release building have not been rerun yet for the refreshed corpus.
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
playwright install chromiumOn Ubuntu or other minimal Linux hosts, you may also need:
sudo .venv/bin/playwright install --with-deps chromiumpdf2image requires Poppler on the host system.
- macOS:
brew install poppler - Ubuntu/Debian:
sudo apt-get install poppler-utils
Scrape both upstream sources and download any missing PDFs:
hcg scrapeInspect live discovery without downloading files:
hcg scrape --datasets esc --limit 5 --dry-runRun the end-to-end update flow. This scrapes the source sites, downloads missing PDFs, extracts newly downloaded pages to JSON, aggregates outputs, and rebuilds the ACC/AHA release if that dataset changed:
OPENAI_API_KEY=... hcg syncTarget a specific dataset:
OPENAI_API_KEY=... hcg sync --datasets esc --model gpt-5-miniFor ACC/AHA updates on a desktop session, prefer the visible browser mode because JACC can block headless Chromium with a Cloudflare verification page:
OPENAI_API_KEY=... hcg sync --datasets acc_aha --model gpt-5-mini --show-browserExtract pages directly for a single scraper-managed dataset:
hcg extract --dataset acc_aha --api-key "$OPENAI_API_KEY"Prepare the new locally downloaded PDFs as page images without calling the OpenAI API:
PYTHONPATH=src python3 -m hcg prepare-images --overwrite --jobs 4 --image-format jpgThat command defaults to:
data/ESC-NEW->data/prepared_images/rerun_2026_05_18_api_ready/esc_newdata/AHA-ACC-NEW->data/prepared_images/rerun_2026_05_18_api_ready/acc_aha_new
Each guideline gets its own directory:
data/prepared_images/rerun_2026_05_18_api_ready/<dataset>/<document-slug>/
manifest.json
pages/
page_0001.jpg
page_0002.jpg
The top-level corpus_manifest.json records every source PDF, rendered page count, output
directory, and any failures. This is the checkpoint to inspect before spending API credits.
Rerun only stored error pages:
OPENAI_API_KEY=... hcg extract --rerun-error-pagesBuild the ACC/AHA release from raw outputs after extraction artifacts are available:
hcg build-releaseWithout installing the package:
PYTHONPATH=src python -m hcg sync --datasets all --model gpt-5-minipytest
python -m hcg scrape --datasets esc --limit 1 --dry-run
python -m hcg build-release- ACC scraping uses Playwright because the ACC site links out to JACC-hosted documents that are not reliably downloadable through plain HTTP requests.
- New reruns should keep each PDF compartmentalized: source PDF in the local corpus folder, rendered
images under
data/prepared_images/.../<document-slug>/pages, page JSONs under a matching document output directory, and final release/serving artifacts built only after validation. - The extraction prompt is image-layout aware. It asks the model to preserve recommendation rows, Class/COR, Level/LOE, figures/tables, and page-continuity flags rather than flattening pages into loose prose.
- Gene extraction is intentionally conservative. The model is instructed to omit bare clinical
abbreviations unless the page visually supports a human gene interpretation, and downstream release
building still validates symbols against
data/reference/gene_names.json. - If Playwright Chromium is not installed, ACC scraping now fails with a direct instruction to run
.venv/bin/playwright install chromium. - JACC can still block automated access behind a Cloudflare verification page, even in a visible browser. When that happens, the ACC scraper records the item as
blockedin the manifest and continues instead of hanging. - ESC scraping now ignores ESC declaration-of-interest attachments, follows the linked journal article, and renders the article page to PDF for extraction.
- Existing ESC PDFs that look like declaration-of-interest reports are treated as stale and replaced on the next
hcg scrapeorhcg syncrun. - Scraper logs are written to
data/<dataset>/scraper.log. - Scraper manifests are written to
data/<dataset>/scraper_manifest.json. hcg syncandhcg extractnow fail immediately with a clear error ifOPENAI_API_KEYis not set.hcg syncextracts any tracked PDFs that are still missing JSON outputs, even if those PDFs were downloaded in an earlier run.hcg syncdoes not redownload PDFs that already exist locally and match the upstream scraper catalog.- If you ever install
hcgnon-editably, setHCG_PROJECT_ROOT=/absolute/path/to/HCGso outputs still land in the repodata/directory.
The repository is intentionally data-heavy because it ships the refreshed guideline PDFs used for the next HeartBioPortal guideline extraction run.
HCG is the source-document and structured-extraction layer for cardiovascular guideline evidence in HeartBioPortal 3.0. HCG-KG builds graph and query artifacts from parsed guideline JSON, and DataHub can consume guideline-derived outputs for HBP search dossiers.
Related HBP 3.0 repositories:
- HeartBioPortal organization: https://github.com/HeartBioPortal
- Live site: https://heartbioportal.org/
- DataHub: https://github.com/HeartBioPortal/DataHub
- HCG-KG: https://github.com/HeartBioPortal/HCG-KG
This repository supports the HeartBioPortal 3.0 NAR Database Issue manuscript release (v3.0.0-nar). Release-support files include citation metadata, source and output manifests, provenance documentation, release notes, and checksum tooling.
Clinical guideline outputs are intended to expose guideline context. They should not be interpreted as medical advice, automated clinical recommendations, or direct clinical actionability.
No controlled individual-level human data should be committed. Do not commit API keys, credentials, protected data, tokens, or restricted source data. Source-specific licensing controls redistribution of guideline PDFs, snippets, and other third-party content; if redistribution rights are uncertain, document the source in GUIDELINE_SOURCES.tsv rather than adding new raw data.