HeartBioPortal DataHub is the data integration and publishing layer behind HeartBioPortal. It standardizes heterogeneous cardiovascular genomics and omics datasets, preserves provenance, performs raw-level integration, and emits both legacy-compatible analyzed outputs and newer serving artifacts.
Comprehensive documentation lives under docs/ and can also be served as a documentation website.
- Start with:
docs/index.md - Script standards:
SCRIPT_MANIFESTO.md - Architecture guide:
docs/architecture/ - Pipeline guides:
docs/pipelines/ - Extension/contributor guides:
docs/extending/anddocs/contributing.md
Local docs preview:
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[docs]"
mkdocs serveProduction docs site:
mkdocs.ymldefines the site..github/workflows/docs.ymlbuilds and deploys to GitHub Pages.- Enable GitHub Pages in the repository and choose
GitHub Actionsas the source.
git clone https://github.com/HeartBioPortal/DataHub.git
cd DataHub
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ".[test]"
python -m pytestFor script-only environments, requirements.txt is still available. The
package metadata in pyproject.toml is the canonical development install path.
pip install -r requirements.txtCurrent runtime requirements:
jsonschemajsonschema2mdPyGithubpandasduckdbrequests
Test dependencies live under the test optional extra in pyproject.toml.
scripts/prepare_association_raw.pyscripts/build_legacy_association.pyscripts/run_ingestion.pyscripts/run_structural_variant_ingestion.pyscripts/dataset_specific_scripts/mvp/run_mvp_pipeline.pyscripts/dataset_specific_scripts/unified/run_unified_pipeline.pyscripts/dataset_specific_scripts/unified/run_secondary_analyses.pyscripts/dataset_specific_scripts/unified/run_gene_profile_pipeline.pyscripts/dataset_specific_scripts/unified/build_dbsnp_frequency_index.pyscripts/dataset_specific_scripts/unified/build_dbsnp_frequency_parquet.pyscripts/dataset_specific_scripts/unified/canonicalize_variant_viewer_artifacts.pyscripts/report_artifact_qa.py
Editable installs also expose console commands such as:
datahub-run-ingestiondatahub-run-unified-pipelinedatahub-ingest-mvp-duckdb-fastdatahub-publish-unified-from-duckdbdatahub-build-serving-duckdbdatahub-run-secondary-analysesdatahub-run-gene-profile-pipelinedatahub-build-dbsnp-frequency-indexdatahub-build-dbsnp-frequency-parquetdatahub-report-artifact-qa
src/datahub/: reusable pipeline, adapter, config, validation, storage, and publisher codeconfig/: profiles, manifests, runtime configs, phenotype hierarchy, output contracts, and export manifestsraw_data/: small checked-in standalone source files organized by source IDanalyzed_data/: curated analyzed artifacts and merge/metadata seed payloads organized by source IDscripts/: operational entrypoints for preparation, ingest, publish, and orchestrationtests/: focused coverage for adapters, manifests, publishers, runners, and serving buildersdocs/: contributor-facing documentation published athttps://heartbioportal.github.io/DataHub/
Config JSON files are validated by JSON Schemas in config/schemas/.
- Keep biological/analytical logic in DataHub, not in downstream application layers.
- Preserve provenance as early as possible and avoid throwing detail away during normalization.
- Make source-specific behavior explicit through config and adapters, not hidden conditionals.
- Keep published outputs stable for consumers while allowing additive metadata evolution.
- Separate concerns between raw preparation, canonical ingestion, analyzed publication, and serving artifacts.
DataHub still supports legacy HeartBioPortal-compatible analyzed payloads, but the codebase now also maintains a newer serving-artifact path based on DuckDB. The legacy path exists for compatibility; the unified DuckDB-first path is the strategic direction.
DataHub is the canonical HBP 3.0 data-owner repository. It prepares source manifests, normalizes source-specific fields, publishes association and secondary-analysis artifacts, and builds serving datamarts consumed by the HeartBioPortal backend and frontend.
Related HBP 3.0 repositories:
- HeartBioPortal organization: https://github.com/HeartBioPortal
- Live site: https://heartbioportal.org/
- HCG guideline extraction resource: https://github.com/HeartBioPortal/HCG
- HCG-KG guideline knowledge graph resource: https://github.com/HeartBioPortal/HCG-KG
This repository supports the HeartBioPortal 3.0 NAR Database Issue manuscript release (v3.0.0-nar). The release-support files in this repository describe source provenance, licensing constraints, generated artifacts, reproducibility expectations, and files that should or should not be included in a public archive.
Release metadata and manifests:
CITATION.cff.zenodo.jsonRELEASE_NOTES.mdMANIFEST.mdDATA_SOURCES.tsvDATA_SOURCES.mdARTIFACT_MANIFEST.tsvBUILD_METADATA.jsonLICENSES.mdPROVENANCE_SCHEMA.mddocs/schemas/*.mdscripts/generate_checksums.sh
Use this checklist before creating a GitHub release or Zenodo archive:
- Confirm the release branch and commit:
git status --short --branch
git rev-parse HEAD- Validate the code and docs in the target environment:
python -m pytest
mkdocs build --strict- Regenerate or review manifest files:
- Review
DATA_SOURCES.tsvandDATA_SOURCES.mdagainst the current source configs and pipeline inputs. - Review
ARTIFACT_MANIFEST.tsvagainst production QA reports, generated artifact directories, and serving DB tables. - Update
BUILD_METADATA.jsonwith the final release commit, build date, schema version, and verified production metrics.
- Verify counts from production artifacts where available:
datahub-report-artifact-qa --helpCounts that cannot be verified from committed local artifacts should remain TBD; verify from production QA.
- Generate release checksums for release-relevant static files:
scripts/generate_checksums.sh- Include in Zenodo:
- repository source code
- config schemas and manifests
- documentation
- small examples or seed metadata that are redistributable
- generated release manifests and checksum files
- Do not include in Zenodo unless redistribution has been confirmed:
- controlled individual-level human data
- API keys, credentials, tokens, or secrets
- raw DrugBank full database files
- large source datasets with unclear redistribution rights
- controlled-access or license-restricted third-party source files
- massive generated artifacts unless they are intended, permitted, and documented for the release package
No controlled individual-level human data should be committed to this repository. Do not commit API keys, credentials, protected data, tokens, or restricted source data. Source-specific licensing controls redistribution of third-party data; if redistribution rights are uncertain, document the source in DATA_SOURCES.tsv or LICENSES.md rather than committing the data.
See LICENSE.