nf-autoannotate is a Nextflow pipeline for annotating a query .h5ad dataset with three independent methods:
- CellTypist
- scANVI
- PanHumanPy / Pan-Human Azimuth
The pipeline writes the method outputs back into the original query AnnData object and produces a cluster-level summary table. It does not create a consensus label.
The workflow is implemented in main.nf, with supporting Python scripts under scripts/.
prepare_scanpy_dataset.py is a utility script for creating local demo reference/query .h5ad files from a Scanpy dataset or an existing AnnData file.
The default workflow stages are:
- Validate the reference and query AnnData files, then write
validation_manifest.jsonandshared_genes.txt. - Train or load a CellTypist model.
- Train or load scVI/scANVI reference models.
- Annotate the query with CellTypist.
- Annotate the query with scANVI query mapping.
- Annotate the query with PanHumanPy.
- Merge method outputs into the query
.h5adand write a cluster annotation summary.
A training-only entry workflow is also available with -entry TRAIN_MODELS. It runs validation and model training, publishes CellTypist plus scVI/scANVI artifacts, and skips query annotation, PanHumanPy, and merge outputs.
--ref_h5ad: reference AnnData file.--query_h5ad: query AnnData file to annotate.--project_tag: output filename prefix.--ref_label_col: referenceobscolumn containing cell type labels.
Optional reference filtering:
--ref_filter_col: referenceobscolumn used to filter the reference before model training.--ref_filter_values: comma-separated values to keep from--ref_filter_col, for exampleE12,E14ortreated,control.
--ref_filter_col and --ref_filter_values must be supplied together.
Run parameters are normally defined in a YAML file passed with -params-file. Start from examples/autoannotate.params.yml, edit the values for your run, and launch the workflow with that file. Parameter defaults and runtime settings live in nextflow.config, and user-facing parameters are described and validated by nextflow_schema.json.
The pipeline selects model behavior from artifact flags instead of explicit mode flags:
- CellTypist trains from the reference unless
--celltypist_modelis supplied. - scANVI loads
--scanvi_model_dirwhen supplied; otherwise it trains a reference scANVI model from labeled reference cells, optionally initialized from--scvi_reference_model_dir, then maps the query with scANVI query mapping. This query-mapping step uses the scArches-style transfer-learning support in scvi-tools. - PanHumanPy always annotates the query directly.
Required when using pretrained or external artifacts:
--celltypist_model: required to reuse a pretrained CellTypist model.--scanvi_model_dir: required to reuse a pretrained scANVI model directory.--scvi_reference_model_dir: required only when initializing scANVI training from a pretrained scVI reference model. Do not combine this with--scanvi_model_dir.
CellTypist optional controls:
--celltypist_training_mode:standardordetailed. Default:detailed.--celltypist_balance_cells_per_label: maximum reference cells per label in detailed mode. Default:500.--celltypist_feature_selection: run tutorial-style feature selection in detailed mode. Default:true.--celltypist_feature_selection_top_genes: top genes per label for feature selection. Default:100.--celltypist_feature_selection_max_iter: quick SGD feature-selection iterations. Default:5.--celltypist_max_iter: final CellTypist fit iterations. Default:100.
CellTypist model reuse checks that at least 80% of model genes are present in the query. This catches gene identifier mismatches before silently filling most model features with zeros.
scANVI optional controls:
--batch_key: optional column present in both reference and queryobs; omitted values stay as realNone/null. Use this when batches should be modeled.--scvi_layer: optional AnnData layer passed toSCVI.setup_anndata.--scanvi_categorical_covariate_keys: optional comma-separated additional categorical covariate columns present in both objects.--scanvi_continuous_covariate_keys: optional comma-separated continuous covariate columns present in both objects.--scanvi_hvg_n_top_genes: HVGs for new scVI/scANVI training. Set0to keep all shared genes. Default:2500.--scvi_n_hidden,--scvi_n_latent,--scvi_n_layers,--scvi_dropout_rate: optional scVI architecture controls. Unset values use scvi-tools defaults.--scvi_dispersion,--scvi_gene_likelihood,--scvi_latent_distribution: optional scVI model distribution controls.--scvi_train_max_epochs,--scvi_train_batch_size,--scvi_train_train_size,--scvi_train_validation_size,--scvi_train_early_stopping,--scvi_train_early_stopping_patience: optionalSCVI.traincontrols.--scanvi_train_max_epochs,--scanvi_train_batch_size,--scanvi_train_train_size,--scanvi_train_validation_size,--scanvi_train_early_stopping,--scanvi_train_early_stopping_patience: optional referenceSCANVI.traincontrols.--scanvi_query_max_epochs,--scanvi_query_batch_size,--scanvi_query_train_size,--scanvi_query_validation_size,--scanvi_query_early_stopping,--scanvi_query_early_stopping_patience,--scanvi_query_check_val_every_n_epoch,--scanvi_query_plan_weight_decay: optional query-mappingSCANVI.traincontrols.--scanvi_query_max_epochsdefaults to100; the other values preserve current script defaults when unset.--unknown_label: unlabeled category used by scANVI during query mapping. Default:Unknown.
There are two ways to configure scVI/scANVI training:
- Use native nf-autoannotate params, usually through
-params-file examples/autoannotate.params.yml. This is the recommended mode for new runs. - Use an nf-scVI-metrics-style Python config with
--nf_scvi_metrics_config config.py. This compatibility mode is for reusing existing nf-scVI-metrics config files without adding that Python file path to the YAML params file.
--nf_scvi_metrics_config accepts the same top-level model_input and param_input dictionaries used by nf-scVI-metrics/examples/inputs.py. Because nf-autoannotate runs one model configuration per pipeline run, each model/training parameter must resolve to one value. Single-item lists such as n_latent = [30] are accepted and unwrapped; sweep lists such as n_layers = [1, 2] are rejected during input validation.
Supported model_input keys are n_hidden, n_latent, n_layers, dropout_rate, dispersion, gene_likelihood, latent_distribution, max_epochs, accelerator, devices, train_size, validation_size, batch_size, early_stopping, early_stopping_patience, and counts_layer. counts_layer is mapped to the scvi-tools layer setup argument. nf-scVI-metrics dataset selectors such as adata_path are ignored because nf-autoannotate uses --ref_h5ad and --query_h5ad.
Supported param_input keys are batch_key, categorical_covariate_keys, and continuous_covariate_keys. layer is also accepted for native scvi-tools naming. categorical_covariate_keys and continuous_covariate_keys may be lists because they are single scvi-tools parameter values, not sweep lists.
Two optional nf-autoannotate-only sections are also supported: scanvi_train_input for reference scANVI fine-tuning, and query_train_input for query-mapping SCANVI.train. If scanvi_train_input is omitted, the model_input training parameters are reused. In query_train_input, use plan_weight_decay for query-training weight decay.
When --nf_scvi_metrics_config is supplied, individual Nextflow params for setup/model/train/query-train settings are intentionally ignored so the Python config file is the single source of truth for scVI/scANVI settings. The config file is checked during input validation, before scVI/scANVI training starts. The old --scvi_training_config parameter is no longer supported.
PanHumanPy optional controls:
--panhumanpy_feature_names_col: optional queryvarcolumn containing gene symbols whenquery.var_namesare not gene symbols.
Output naming and merge behavior:
--outdir: output directory. Default:autoannotate-results-<project_tag>.--scanvi_obs_col: scANVI label column name in output. Default:scanvi_label.--scanvi_score_col: scANVI confidence column name in output. Default:scanvi_confidence.--query_cluster_col: existing queryobscluster column used for cluster summaries. Default:leiden.--compute_missing_clusters: compute Leiden clusters in the merge step if--query_cluster_colis absent. Default:false.--marker_genes_n_top: number of marker genes to report per cluster. Default:50.--marker_genes_method: differential-expression test for marker ranking, eitherwilcoxonort-test. Default:wilcoxon.--marker_dotplot: write a marker-gene dotplot PNG from the ranked marker table. Default:true.--marker_dotplot_n_top: number of ranked marker genes per cluster to include in the dotplot. Default:5.
When missing clusters are explicitly computed, the merge step reuses existing neighbors or X_pca if present. Otherwise it runs PCA/neighbors/Leiden on a normalized, log-transformed copy so the final query matrix is not modified.
Parameter defaults live in nextflow.config; user-facing parameter validation lives in nextflow_schema.json.
Before model steps run, the pipeline checks that:
- input files exist
ref_label_colexists inref_h5ad.obsbatch_key, if supplied, exists in both reference and queryobs- all
scanvi_categorical_covariate_keys, if supplied, exist in both objects - all
scanvi_continuous_covariate_keys, if supplied, exist in both objects panhumanpy_feature_names_col, if supplied, exists inquery_h5ad.varquery_h5ad.Xcontains raw non-negative integer countsquery_cluster_colexists inquery_h5ad.obs, unlesscompute_missing_clustersis enabled- reference filters are valid and keep at least one reference cell
- reference and query cell indices are unique
- reference and query gene indices are unique
- reference and query gene identifiers use the same broad scheme, such as symbols or Ensembl IDs
- reference and query share at least one gene
The merge step also validates that every annotation CSV has unique cell_id values matching query.obs_names exactly. Missing or extra cells now fail the merge instead of becoming silent NaN values.
Published files:
<project_tag>.annotated.h5ad<project_tag>.cluster_annotation_summary.csv<project_tag>.cluster_marker_genes.tsv<project_tag>.cluster_marker_dotplot.pngwhen--marker_dotplot trueand at least two clusters are available
The annotated .h5ad is the query object with additional obs columns and scANVI latent representations in obsm.
Method columns copied into query.obs:
celltypist_predicted_label: raw per-cell CellTypist label before majority voting.celltypist_majority_voting: CellTypist label after majority voting.celltypist_confidence: probability for the majority-voting label when available; otherwise the row maximum.<scanvi_obs_col>: scANVI predicted label. Default:scanvi_label.<scanvi_score_col>: maximum scANVI class probability. Default:scanvi_confidence.panhumanpy_full_hierarchical_label: PanHumanPy full hierarchy label.panhumanpy_level_zero_label: PanHumanPy broadest label.panhumanpy_final_level_label: PanHumanPy final selected label.panhumanpy_confidence: PanHumanPy final-level softmax probability.panhumanpy_azimuth_broad,panhumanpy_azimuth_medium,panhumanpy_azimuth_fine: refined PanHumanPy labels when returned.
Latent representations copied into query.obsm:
X_scanvi: latent representation returned byscvi-toolsSCANVI.get_latent_representationduring query annotation.X_scvi: compatibility alias of the same scANVI latent representation for downstream tools that expect the scVI-style key.
The latent arrays are written by annotate_scanvi.py to a separate scanvi_latent.h5ad artifact and merged into the final annotated .h5ad; the scanvi_predictions.csv table stays limited to per-cell labels and confidence scores.
Comparison columns created in query.obs:
celltypist_scanvi_agree: per-cell boolean;trueonly when both methods have non-missing labels andcelltypist_majority_votingequals the scANVI label.celltypist_scanvi_cluster_agreement_fraction: cluster-level fraction copied onto each cell in the cluster; the fraction of cells wherecelltypist_scanvi_agreeistrue.
The cluster summary is grouped by --query_cluster_col and includes:
<query_cluster_col>: cluster identifier.n_cells: number of cells in the cluster.celltypist_majority_top_label: most frequent CellTypist majority-voting label.celltypist_majority_top_fraction: fraction of non-missing CellTypist labels assigned to that top label.scanvi_top_label: most frequent scANVI label.scanvi_top_fraction: fraction of non-missing scANVI labels assigned to that top label.panhumanpy_top_label: most frequent PanHumanPy final-level label, reported independently.panhumanpy_top_fraction: fraction of non-missing PanHumanPy labels assigned to that top label.celltypist_scanvi_cluster_modal_match: whether the CellTypist and scANVI modal labels match in the cluster.celltypist_scanvi_cluster_agreement_fraction: fraction of cells in the cluster where CellTypist and scANVI agree.
celltypist_scanvi_agree and celltypist_scanvi_cluster_modal_match are intentionally different summaries. A cluster can have matching modal labels while many individual cells disagree. The modal-match boolean is therefore kept in the cluster summary, while the more interpretable agreement fraction is copied to query.obs.
The marker-gene table is grouped by --query_cluster_col and written to <project_tag>.cluster_marker_genes.tsv. Marker genes are ranked with Scanpy rank_genes_groups on a normalized, log-transformed copy of the query object, so marker calculation does not modify the matrix written to <project_tag>.annotated.h5ad. The table includes:
cluster: cluster identifier.rank: one-based marker rank within the cluster.gene: gene identifier fromquery.var_names.score: Scanpy rank score for the selected method.logfoldchange: estimated log fold change for the cluster against the remaining cells.p_value: nominal p-value.p_value_adj: adjusted p-value.
If the query contains fewer than two clusters, the marker-gene TSV is still written with headers but no marker rows because there is no between-cluster comparison. The marker dotplot is skipped in that case.
When --marker_dotplot true, the merge step writes <project_tag>.cluster_marker_dotplot.png with Scanpy dotplot using the top --marker_dotplot_n_top ranked genes per cluster from the same marker calculation. Set --marker_dotplot false to skip the PNG output while keeping the marker-gene TSV unchanged.
The default workflow publishes final annotation outputs and model-training artifacts, including:
<project_tag>.annotated.h5ad<project_tag>.cluster_annotation_summary.csv<project_tag>.cluster_marker_genes.tsv<project_tag>.cluster_marker_dotplot.pngwhen marker dotplots are enabledvalidation_manifest.jsonshared_genes.txtcelltypist_model.pklcelltypist_model_metadata.jsonscvi_model/scvi_model_metadata.jsonscanvi_model/scanvi_model_metadata.json
Intermediate annotation files remain in Nextflow work directories, including celltypist_predictions.csv, scanvi_predictions.csv, scanvi_latent.h5ad, and panhumanpy_predictions.csv.
Nextflow timeline, report, and trace files are written under autoannotate-reports-<project_tag>/.
To create local reference and query inputs, run:
python scripts/prepare_scanpy_dataset.py --outdir data/demoBy default this downloads the full PBMC68K object into data/demo/raw/, writes an 80/20 reference/query split, and records split metadata:
data/demo/reference.h5addata/demo/query.h5addata/demo/dataset_split_metadata.json
Use the generated files for a demo run by editing a params file:
cp examples/autoannotate.params.yml demo.params.ymlref_h5ad: data/demo/reference.h5ad
query_h5ad: data/demo/query.h5ad
project_tag: demo
ref_label_col: bulk_labels
query_cluster_col: louvain
outdir: results/demomodule load cellgen/nextflow
nextflow run main.nf -params-file demo.params.ymlThe helper also supports other sources:
python scripts/prepare_scanpy_dataset.py --dataset pbmc68k_reduced --outdir data/pbmc68k_reduced_demo
python scripts/prepare_scanpy_dataset.py --dataset pbmc3k --outdir data/pbmc3k_demo
python scripts/prepare_scanpy_dataset.py --input-h5ad /path/to/input.h5ad --outdir data/custom_demo--dataset pbmc68k is the default and downloads the larger Figshare-hosted PBMC68K object into the selected output directory before splitting it. --dataset pbmc68k_reduced and --dataset pbmc3k use datasets provided directly by Scanpy.
Train CellTypist and scANVI from the reference:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--batch_key donor \
--query_cluster_col leidenTrain CellTypist plus scVI/scANVI model artifacts only:
module load cellgen/nextflow
nextflow run main.nf -entry TRAIN_MODELS \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag model_training \
--ref_label_col cell_type \
--batch_key donor \
--query_cluster_col leiden \
--outdir results/model_trainingReuse a pretrained CellTypist model while training scANVI:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--batch_key donor \
--celltypist_model /path/to/celltypist_model.pklReuse pretrained CellTypist and scANVI artifacts:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--celltypist_model /path/to/celltypist_model.pkl \
--scanvi_model_dir /path/to/scanvi_model_dirInitialize scANVI training from a pretrained scVI reference model:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--batch_key donor \
--scvi_reference_model_dir /path/to/scvi_model_dirAllow the merge step to compute missing Leiden clusters:
module load cellgen/nextflow
nextflow run main.nf \
--ref_h5ad /path/to/reference.h5ad \
--query_h5ad /path/to/query.h5ad \
--project_tag test_run \
--ref_label_col cell_type \
--compute_missing_clusters true- Input
.h5adfiles should already be normalized/preprocessed appropriately for the chosen annotation methods. - Gene alignment for CellTypist uses the reference-query intersection from validation. Fresh scANVI training starts from the shared gene set to build a reference model; query mapping follows the saved scANVI model registry.
- PanHumanPy receives the original query feature space, optionally using
--panhumanpy_feature_names_colfor symbols. - CellTypist, scANVI, and PanHumanPy outputs are kept separate.
- PanHumanPy is not compared with CellTypist or scANVI because it uses a different reference.