Skip to content

cellgeni/nf-autoannotate

Repository files navigation

nf-autoannotate

nf-autoannotate is a Nextflow pipeline for annotating a query .h5ad dataset with three independent methods:

  • CellTypist
  • scANVI
  • PanHumanPy / Pan-Human Azimuth

The pipeline writes the method outputs back into the original query AnnData object and produces a cluster-level summary table. It does not create a consensus label.

Workflow

The workflow is implemented in main.nf, with supporting Python scripts under scripts/.

prepare_scanpy_dataset.py is a utility script for creating local demo reference/query .h5ad files from a Scanpy dataset or an existing AnnData file.

The default workflow stages are:

  1. Validate the reference and query AnnData files, then write validation_manifest.json and shared_genes.txt.
  2. Train or load a CellTypist model.
  3. Train or load scVI/scANVI reference models.
  4. Annotate the query with CellTypist.
  5. Annotate the query with scANVI query mapping.
  6. Annotate the query with PanHumanPy.
  7. Merge method outputs into the query .h5ad and write a cluster annotation summary.

A training-only entry workflow is also available with -entry TRAIN_MODELS. It runs validation and model training, publishes CellTypist plus scVI/scANVI artifacts, and skips query annotation, PanHumanPy, and merge outputs.

Required Inputs

  • --ref_h5ad: reference AnnData file.
  • --query_h5ad: query AnnData file to annotate.
  • --project_tag: output filename prefix.
  • --ref_label_col: reference obs column containing cell type labels.

Optional reference filtering:

  • --ref_filter_col: reference obs column used to filter the reference before model training.
  • --ref_filter_values: comma-separated values to keep from --ref_filter_col, for example E12,E14 or treated,control.

--ref_filter_col and --ref_filter_values must be supplied together.

Run parameters are normally defined in a YAML file passed with -params-file. Start from examples/autoannotate.params.yml, edit the values for your run, and launch the workflow with that file. Parameter defaults and runtime settings live in nextflow.config, and user-facing parameters are described and validated by nextflow_schema.json.

Tool-Specific Parameters

The pipeline selects model behavior from artifact flags instead of explicit mode flags:

  • CellTypist trains from the reference unless --celltypist_model is supplied.
  • scANVI loads --scanvi_model_dir when supplied; otherwise it trains a reference scANVI model from labeled reference cells, optionally initialized from --scvi_reference_model_dir, then maps the query with scANVI query mapping. This query-mapping step uses the scArches-style transfer-learning support in scvi-tools.
  • PanHumanPy always annotates the query directly.

Required when using pretrained or external artifacts:

  • --celltypist_model: required to reuse a pretrained CellTypist model.
  • --scanvi_model_dir: required to reuse a pretrained scANVI model directory.
  • --scvi_reference_model_dir: required only when initializing scANVI training from a pretrained scVI reference model. Do not combine this with --scanvi_model_dir.

CellTypist optional controls:

  • --celltypist_training_mode: standard or detailed. Default: detailed.
  • --celltypist_balance_cells_per_label: maximum reference cells per label in detailed mode. Default: 500.
  • --celltypist_feature_selection: run tutorial-style feature selection in detailed mode. Default: true.
  • --celltypist_feature_selection_top_genes: top genes per label for feature selection. Default: 100.
  • --celltypist_feature_selection_max_iter: quick SGD feature-selection iterations. Default: 5.
  • --celltypist_max_iter: final CellTypist fit iterations. Default: 100.

CellTypist model reuse checks that at least 80% of model genes are present in the query. This catches gene identifier mismatches before silently filling most model features with zeros.

scANVI optional controls:

  • --batch_key: optional column present in both reference and query obs; omitted values stay as real None/null. Use this when batches should be modeled.
  • --scvi_layer: optional AnnData layer passed to SCVI.setup_anndata.
  • --scanvi_categorical_covariate_keys: optional comma-separated additional categorical covariate columns present in both objects.
  • --scanvi_continuous_covariate_keys: optional comma-separated continuous covariate columns present in both objects.
  • --scanvi_hvg_n_top_genes: HVGs for new scVI/scANVI training. Set 0 to keep all shared genes. Default: 2500.
  • --scvi_n_hidden, --scvi_n_latent, --scvi_n_layers, --scvi_dropout_rate: optional scVI architecture controls. Unset values use scvi-tools defaults.
  • --scvi_dispersion, --scvi_gene_likelihood, --scvi_latent_distribution: optional scVI model distribution controls.
  • --scvi_train_max_epochs, --scvi_train_batch_size, --scvi_train_train_size, --scvi_train_validation_size, --scvi_train_early_stopping, --scvi_train_early_stopping_patience: optional SCVI.train controls.
  • --scanvi_train_max_epochs, --scanvi_train_batch_size, --scanvi_train_train_size, --scanvi_train_validation_size, --scanvi_train_early_stopping, --scanvi_train_early_stopping_patience: optional reference SCANVI.train controls.
  • --scanvi_query_max_epochs, --scanvi_query_batch_size, --scanvi_query_train_size, --scanvi_query_validation_size, --scanvi_query_early_stopping, --scanvi_query_early_stopping_patience, --scanvi_query_check_val_every_n_epoch, --scanvi_query_plan_weight_decay: optional query-mapping SCANVI.train controls. --scanvi_query_max_epochs defaults to 100; the other values preserve current script defaults when unset.
  • --unknown_label: unlabeled category used by scANVI during query mapping. Default: Unknown.

There are two ways to configure scVI/scANVI training:

  1. Use native nf-autoannotate params, usually through -params-file examples/autoannotate.params.yml. This is the recommended mode for new runs.
  2. Use an nf-scVI-metrics-style Python config with --nf_scvi_metrics_config config.py. This compatibility mode is for reusing existing nf-scVI-metrics config files without adding that Python file path to the YAML params file.

--nf_scvi_metrics_config accepts the same top-level model_input and param_input dictionaries used by nf-scVI-metrics/examples/inputs.py. Because nf-autoannotate runs one model configuration per pipeline run, each model/training parameter must resolve to one value. Single-item lists such as n_latent = [30] are accepted and unwrapped; sweep lists such as n_layers = [1, 2] are rejected during input validation.

Supported model_input keys are n_hidden, n_latent, n_layers, dropout_rate, dispersion, gene_likelihood, latent_distribution, max_epochs, accelerator, devices, train_size, validation_size, batch_size, early_stopping, early_stopping_patience, and counts_layer. counts_layer is mapped to the scvi-tools layer setup argument. nf-scVI-metrics dataset selectors such as adata_path are ignored because nf-autoannotate uses --ref_h5ad and --query_h5ad.

Supported param_input keys are batch_key, categorical_covariate_keys, and continuous_covariate_keys. layer is also accepted for native scvi-tools naming. categorical_covariate_keys and continuous_covariate_keys may be lists because they are single scvi-tools parameter values, not sweep lists.

Two optional nf-autoannotate-only sections are also supported: scanvi_train_input for reference scANVI fine-tuning, and query_train_input for query-mapping SCANVI.train. If scanvi_train_input is omitted, the model_input training parameters are reused. In query_train_input, use plan_weight_decay for query-training weight decay.

When --nf_scvi_metrics_config is supplied, individual Nextflow params for setup/model/train/query-train settings are intentionally ignored so the Python config file is the single source of truth for scVI/scANVI settings. The config file is checked during input validation, before scVI/scANVI training starts. The old --scvi_training_config parameter is no longer supported.

PanHumanPy optional controls:

  • --panhumanpy_feature_names_col: optional query var column containing gene symbols when query.var_names are not gene symbols.

Output naming and merge behavior:

  • --outdir: output directory. Default: autoannotate-results-<project_tag>.
  • --scanvi_obs_col: scANVI label column name in output. Default: scanvi_label.
  • --scanvi_score_col: scANVI confidence column name in output. Default: scanvi_confidence.
  • --query_cluster_col: existing query obs cluster column used for cluster summaries. Default: leiden.
  • --compute_missing_clusters: compute Leiden clusters in the merge step if --query_cluster_col is absent. Default: false.
  • --marker_genes_n_top: number of marker genes to report per cluster. Default: 50.
  • --marker_genes_method: differential-expression test for marker ranking, either wilcoxon or t-test. Default: wilcoxon.
  • --marker_dotplot: write a marker-gene dotplot PNG from the ranked marker table. Default: true.
  • --marker_dotplot_n_top: number of ranked marker genes per cluster to include in the dotplot. Default: 5.

When missing clusters are explicitly computed, the merge step reuses existing neighbors or X_pca if present. Otherwise it runs PCA/neighbors/Leiden on a normalized, log-transformed copy so the final query matrix is not modified.

Parameter defaults live in nextflow.config; user-facing parameter validation lives in nextflow_schema.json.

Validation

Before model steps run, the pipeline checks that:

  • input files exist
  • ref_label_col exists in ref_h5ad.obs
  • batch_key, if supplied, exists in both reference and query obs
  • all scanvi_categorical_covariate_keys, if supplied, exist in both objects
  • all scanvi_continuous_covariate_keys, if supplied, exist in both objects
  • panhumanpy_feature_names_col, if supplied, exists in query_h5ad.var
  • query_h5ad.X contains raw non-negative integer counts
  • query_cluster_col exists in query_h5ad.obs, unless compute_missing_clusters is enabled
  • reference filters are valid and keep at least one reference cell
  • reference and query cell indices are unique
  • reference and query gene indices are unique
  • reference and query gene identifiers use the same broad scheme, such as symbols or Ensembl IDs
  • reference and query share at least one gene

The merge step also validates that every annotation CSV has unique cell_id values matching query.obs_names exactly. Missing or extra cells now fail the merge instead of becoming silent NaN values.

Outputs

Published files:

  • <project_tag>.annotated.h5ad
  • <project_tag>.cluster_annotation_summary.csv
  • <project_tag>.cluster_marker_genes.tsv
  • <project_tag>.cluster_marker_dotplot.png when --marker_dotplot true and at least two clusters are available

The annotated .h5ad is the query object with additional obs columns and scANVI latent representations in obsm.

Method columns copied into query.obs:

  • celltypist_predicted_label: raw per-cell CellTypist label before majority voting.
  • celltypist_majority_voting: CellTypist label after majority voting.
  • celltypist_confidence: probability for the majority-voting label when available; otherwise the row maximum.
  • <scanvi_obs_col>: scANVI predicted label. Default: scanvi_label.
  • <scanvi_score_col>: maximum scANVI class probability. Default: scanvi_confidence.
  • panhumanpy_full_hierarchical_label: PanHumanPy full hierarchy label.
  • panhumanpy_level_zero_label: PanHumanPy broadest label.
  • panhumanpy_final_level_label: PanHumanPy final selected label.
  • panhumanpy_confidence: PanHumanPy final-level softmax probability.
  • panhumanpy_azimuth_broad, panhumanpy_azimuth_medium, panhumanpy_azimuth_fine: refined PanHumanPy labels when returned.

Latent representations copied into query.obsm:

  • X_scanvi: latent representation returned by scvi-tools SCANVI.get_latent_representation during query annotation.
  • X_scvi: compatibility alias of the same scANVI latent representation for downstream tools that expect the scVI-style key.

The latent arrays are written by annotate_scanvi.py to a separate scanvi_latent.h5ad artifact and merged into the final annotated .h5ad; the scanvi_predictions.csv table stays limited to per-cell labels and confidence scores.

Comparison columns created in query.obs:

  • celltypist_scanvi_agree: per-cell boolean; true only when both methods have non-missing labels and celltypist_majority_voting equals the scANVI label.
  • celltypist_scanvi_cluster_agreement_fraction: cluster-level fraction copied onto each cell in the cluster; the fraction of cells where celltypist_scanvi_agree is true.

The cluster summary is grouped by --query_cluster_col and includes:

  • <query_cluster_col>: cluster identifier.
  • n_cells: number of cells in the cluster.
  • celltypist_majority_top_label: most frequent CellTypist majority-voting label.
  • celltypist_majority_top_fraction: fraction of non-missing CellTypist labels assigned to that top label.
  • scanvi_top_label: most frequent scANVI label.
  • scanvi_top_fraction: fraction of non-missing scANVI labels assigned to that top label.
  • panhumanpy_top_label: most frequent PanHumanPy final-level label, reported independently.
  • panhumanpy_top_fraction: fraction of non-missing PanHumanPy labels assigned to that top label.
  • celltypist_scanvi_cluster_modal_match: whether the CellTypist and scANVI modal labels match in the cluster.
  • celltypist_scanvi_cluster_agreement_fraction: fraction of cells in the cluster where CellTypist and scANVI agree.

celltypist_scanvi_agree and celltypist_scanvi_cluster_modal_match are intentionally different summaries. A cluster can have matching modal labels while many individual cells disagree. The modal-match boolean is therefore kept in the cluster summary, while the more interpretable agreement fraction is copied to query.obs.

The marker-gene table is grouped by --query_cluster_col and written to <project_tag>.cluster_marker_genes.tsv. Marker genes are ranked with Scanpy rank_genes_groups on a normalized, log-transformed copy of the query object, so marker calculation does not modify the matrix written to <project_tag>.annotated.h5ad. The table includes:

  • cluster: cluster identifier.
  • rank: one-based marker rank within the cluster.
  • gene: gene identifier from query.var_names.
  • score: Scanpy rank score for the selected method.
  • logfoldchange: estimated log fold change for the cluster against the remaining cells.
  • p_value: nominal p-value.
  • p_value_adj: adjusted p-value.

If the query contains fewer than two clusters, the marker-gene TSV is still written with headers but no marker rows because there is no between-cluster comparison. The marker dotplot is skipped in that case.

When --marker_dotplot true, the merge step writes <project_tag>.cluster_marker_dotplot.png with Scanpy dotplot using the top --marker_dotplot_n_top ranked genes per cluster from the same marker calculation. Set --marker_dotplot false to skip the PNG output while keeping the marker-gene TSV unchanged.

The default workflow publishes final annotation outputs and model-training artifacts, including:

  • <project_tag>.annotated.h5ad
  • <project_tag>.cluster_annotation_summary.csv
  • <project_tag>.cluster_marker_genes.tsv
  • <project_tag>.cluster_marker_dotplot.png when marker dotplots are enabled
  • validation_manifest.json
  • shared_genes.txt
  • celltypist_model.pkl
  • celltypist_model_metadata.json
  • scvi_model/
  • scvi_model_metadata.json
  • scanvi_model/
  • scanvi_model_metadata.json

Intermediate annotation files remain in Nextflow work directories, including celltypist_predictions.csv, scanvi_predictions.csv, scanvi_latent.h5ad, and panhumanpy_predictions.csv.

Nextflow timeline, report, and trace files are written under autoannotate-reports-<project_tag>/.

Demo Data

To create local reference and query inputs, run:

python scripts/prepare_scanpy_dataset.py --outdir data/demo

By default this downloads the full PBMC68K object into data/demo/raw/, writes an 80/20 reference/query split, and records split metadata:

  • data/demo/reference.h5ad
  • data/demo/query.h5ad
  • data/demo/dataset_split_metadata.json

Use the generated files for a demo run by editing a params file:

cp examples/autoannotate.params.yml demo.params.yml
ref_h5ad: data/demo/reference.h5ad
query_h5ad: data/demo/query.h5ad
project_tag: demo
ref_label_col: bulk_labels
query_cluster_col: louvain
outdir: results/demo
module load cellgen/nextflow
nextflow run main.nf -params-file demo.params.yml

The helper also supports other sources:

python scripts/prepare_scanpy_dataset.py --dataset pbmc68k_reduced --outdir data/pbmc68k_reduced_demo
python scripts/prepare_scanpy_dataset.py --dataset pbmc3k --outdir data/pbmc3k_demo
python scripts/prepare_scanpy_dataset.py --input-h5ad /path/to/input.h5ad --outdir data/custom_demo

--dataset pbmc68k is the default and downloads the larger Figshare-hosted PBMC68K object into the selected output directory before splitting it. --dataset pbmc68k_reduced and --dataset pbmc3k use datasets provided directly by Scanpy.

Examples

Train CellTypist and scANVI from the reference:

module load cellgen/nextflow

nextflow run main.nf \
  --ref_h5ad /path/to/reference.h5ad \
  --query_h5ad /path/to/query.h5ad \
  --project_tag test_run \
  --ref_label_col cell_type \
  --batch_key donor \
  --query_cluster_col leiden

Train CellTypist plus scVI/scANVI model artifacts only:

module load cellgen/nextflow

nextflow run main.nf -entry TRAIN_MODELS \
  --ref_h5ad /path/to/reference.h5ad \
  --query_h5ad /path/to/query.h5ad \
  --project_tag model_training \
  --ref_label_col cell_type \
  --batch_key donor \
  --query_cluster_col leiden \
  --outdir results/model_training

Reuse a pretrained CellTypist model while training scANVI:

module load cellgen/nextflow

nextflow run main.nf \
  --ref_h5ad /path/to/reference.h5ad \
  --query_h5ad /path/to/query.h5ad \
  --project_tag test_run \
  --ref_label_col cell_type \
  --batch_key donor \
  --celltypist_model /path/to/celltypist_model.pkl

Reuse pretrained CellTypist and scANVI artifacts:

module load cellgen/nextflow

nextflow run main.nf \
  --ref_h5ad /path/to/reference.h5ad \
  --query_h5ad /path/to/query.h5ad \
  --project_tag test_run \
  --ref_label_col cell_type \
  --celltypist_model /path/to/celltypist_model.pkl \
  --scanvi_model_dir /path/to/scanvi_model_dir

Initialize scANVI training from a pretrained scVI reference model:

module load cellgen/nextflow

nextflow run main.nf \
  --ref_h5ad /path/to/reference.h5ad \
  --query_h5ad /path/to/query.h5ad \
  --project_tag test_run \
  --ref_label_col cell_type \
  --batch_key donor \
  --scvi_reference_model_dir /path/to/scvi_model_dir

Allow the merge step to compute missing Leiden clusters:

module load cellgen/nextflow

nextflow run main.nf \
  --ref_h5ad /path/to/reference.h5ad \
  --query_h5ad /path/to/query.h5ad \
  --project_tag test_run \
  --ref_label_col cell_type \
  --compute_missing_clusters true

Assumptions

  • Input .h5ad files should already be normalized/preprocessed appropriately for the chosen annotation methods.
  • Gene alignment for CellTypist uses the reference-query intersection from validation. Fresh scANVI training starts from the shared gene set to build a reference model; query mapping follows the saved scANVI model registry.
  • PanHumanPy receives the original query feature space, optionally using --panhumanpy_feature_names_col for symbols.
  • CellTypist, scANVI, and PanHumanPy outputs are kept separate.
  • PanHumanPy is not compared with CellTypist or scANVI because it uses a different reference.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages