NVD is a Nextflow pipeline focused on finding human-infecting viruses in environmental metagenome samples. It leverages a fast in silico enrichment approach together with de novo assembly and BLAST verification to maximize recall and precision. It supports Illumina short-read as well as Nanopore long-read inputs and offers a configurable deduplication approach tuned for shotgun sequencing datasets.
NVD was designed from the ground up to handle enormous datasets and performs particularly well with complex Illumina deep sequencing datasets like those from wastewater sewersheds. To perform well with these kinds of datasets, it must:
- Handle highly fragmented genome recovery.
- Be resilient to wild fluctuations in depth-of-coverage.
- Resolve ambiguities between closely related organisms with high sequence identity.
Many pipelines for classifying mixtures of organisms exist, but none satisfied these criteria to a satisfactory degree for human viruses--hence, NVD was born!
This pipeline in its 3rd major version, which brings with it a helpful CLI and pipeline control system, vast performance improvements over its predecessor, and a tighter focus on BLAST-searching viral hits. This means better and faster results for viruses, but worse results for other taxa like bacteria. We recommend users try jhuapl-bio/taxtriage or nf-core/mag if they're interested in more than viruses.
NVD set-up has a few phases, including dependency setup, reference database setup, sample data setup, and run command construction. These phases can be handled manually, but we recommend users use our installer script to get started.
NVD includes an interactive install script to help configure database paths and establish user settings:
curl -fsSL https://raw.githubusercontent.com/dholab/nvd/main/install.sh | bashOr download and inspect first:
curl -fsSL https://raw.githubusercontent.com/dholab/nvd/main/install.sh -o install.sh
chmod +x install.sh
./install.shPrerequisites (must be installed separately):
- Java 11 or newer
- Nextflow
- Docker, Apptainer/Singularity, or Pixi (for containerized execution)
The script will help you:
- Check that prerequisites are installed
- Detect available execution environments (Docker, Apptainer)
- Configure database paths
- Optionally download reference databases
- Create a configuration file at
~/.nvd/user.config
See the Installation Guide for more details.
Required dependencies:
- Java 11 or newer (required by Nextflow)
- Nextflow
- Container runtime: Docker, Apptainer/Singularity, or Pixi (for local execution)
With Nextflow and a container runtime installed, the pipeline can run using containerized dependencies. No additional software installation needed.
For Conda-based dependencies (optional): The pipeline includes a pyproject.toml and pixi.lock for reproducible Conda environments via Pixi. Note that Pixi is only needed for managing Conda dependencies - containerized execution (Docker/Apptainer) is generally preferred and doesn't require Pixi.
# If using Pixi for local execution
pixi shell --frozenNote also that new versions of our container image are automatically built and pushed to docker hub on each NVD release.
NVD is predominantly designed to be run offline, which means users must download the required reference files ahead of time. Since version 3.0, the list of required files is significantly decreased; users just need our Deacon index of human infecting viruses, which is less than half a gigabyte, and the BLAST core_nt database, which is about 230 gigabytes. We provide both via our LabKey server as follows.
Important
Prefer the installer for reference setup. The installer can download, verify, arrange, and place the reference artifacts for you, including extracting the BLAST database archive and checking that your system has enough available space for it. Manual download is still supported, but it is easier to get paths, checksums, or extraction layout wrong by hand.
Both of these reference datasets are publicly available via wget or curl from the O'Connor Laboratory's LabKey server, like so:
# download the BLAST database archive
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v3.0.0/v3.0.0/blast_db_v3_0.tar.gz
# download the deacon human-infecting virus index
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v3.0.0/v3.0.0/human_infecting_viruses.k31w1.idx(curl -fSsL can be substituted for wget in the above commands if desired)
If you download manually, verify the files against the checksum manifest:
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v3.0.0/v3.0.0/checksums_v3_0.txtAlso at that endpoint, if desired, is a pre-built Apptainer image file for use on HPC cluster or other linux environments:
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v3.0.0/v3.0.0/nvd-v3.0.0.sifThe installer extracts the BLAST tarball for you. If you download manually, extract blast_db_v3_0.tar.gz before use and point blast_db at the extracted directory. The deacon index is already a single .idx file and does not need extraction.
To reduce sequence search space before assembly and BLAST verification, NVD uses a the aforementioned minimizer index representing human-infecting viruses. This index is derived from NCBI's STAT dense tree-of-life minimizer database, filtered to NVD's curated human-virus taxid list, and converted into deacon's .idx format. It is used as a fast screening step, not as the final taxonomic classifier.
You can pass it through a params file as virus_index, or with the CLI:
nvd run --params-file run.yaml --virus-index /path/to/human_infecting_viruses.k31w1.idxFor the rationale, source STAT database, curated taxid list, conversion script, build parameters, and reproducibility notes, see the Virus Enrichment Index Guide.
After the initial installer has cloned NVD and installed the Pixi environment, you can re-run the setup portion directly with:
nvd setupnvd setup handles the following:
- detects whether you are on an O'Connor Lab CHTC access point or a generic system
- detects your shell and can install the
nvd setup shell-hookline into your shell startup file - writes or refreshes
~/.nvd/setup.conf, including the stable repo path used by the wrapper - on CHTC, records the default profile, shared taxonomy location, and shared preset store location
- on CHTC, creates or refreshes
~/.nvd/user.configfrom the CHTC template if you approve overwriting it - installs or refreshes the lightweight
~/.local/bin/nvdwrapper script if you approve overwriting it - generates shell completions
- on CHTC, offers to download or re-download the release SIF into
~/.nvd
Those conveniences are intentionally separate from run-specific configuration. BLAST database paths, the deacon virus index path, samplesheets, output locations, preprocessing choices, and LabKey settings should still be provided through params files, presets, explicit CLI flags, or direct Nextflow config when you really need it.
Useful setup variants include:
nvd setup --force # overwrite existing setup-managed files without repeated prompts
nvd setup --skip-container # skip the CHTC SIF download prompt
nvd setup --skip-shell-hook # leave shell startup files untouched
nvd setup --config-dir ~/.nvd # choose the NVD config directory explicitlyWith the environment and source code set up, next you'll need to organize your input reads into a simple CSV samplesheet. It must look like this:
sample_id,srr,platform,fastq1,fastq2
nanopore_test,,ont,nanopore.fastq.gz,
illumina_test,,illumina,illumina_R1.fastq.gz,illumina_R1.fastq.gz
sra_test,SRR33296246,,Note that this example samplesheet is provided in the repo's assets directory for convenience.
Again, while manual setup is supported, it's also more error-prone. We instead recommend users take advantage of the nvd samplesheet CLI. It can scan a directory of FASTQ files, infer paired-end samples by filename, preview the generated table, write the CSV, and validate the result before you run the pipeline.
For local FASTQ files:
nvd samplesheet generate --from-dir ./fastqs --platform illumina --output samplesheet.csvFor Nanopore/ONT files, use --platform ont:
nvd samplesheet generate --from-dir ./nanopore-fastqs --platform ont --output samplesheet.csvFor public SRA inputs, put one accession per line in a text file:
SRR33296246
SRR33296247
Then generate a samplesheet from that accession list:
nvd samplesheet generate --from-sra accessions.txt --platform illumina --output samplesheet.csvIf you want to inspect what NVD would write before touching the filesystem, use dry-run mode:
nvd samplesheet generate --from-dir ./fastqs --platform illumina --dry-runYou can also validate any existing samplesheet explicitly:
nvd samplesheet validate samplesheet.csvLike other NVD subcommands, the samplesheet commands also have shorter aliases for command line wizards:
nvd samplesheet gen -d ./fastqs -p illumina -o samplesheet.csv
nvd samplesheet val samplesheet.csv
nvd ss gen -d ./fastqs -p illumina -o samplesheet.csv
nvd ss val samplesheet.csvIf you have reference databases downloaded and a samplesheet generated, you're ready to run NVD!
The impetus for the nvd command line interface was to make complex Nextflow runs easier to launch, validate, resume, and repeat. The nvd run subcommand is the recommended entry point for normal use. It accepts the same kinds of runtime values you would otherwise pass to Nextflow, but it also handles params-file merging, preset lookup, config discovery, CHTC setup integration, taxonomy environment handling, and friendlier dry-run behavior.
For reproducibility, we recommend putting run settings in a YAML or JSON params file rather than building one long shell command. The generated params files include schema references, so editors like VS Code and Neovim can offer autocomplete and inline validation when YAML or JSON language support is installed.
Start by generating a params file and editing it for your environment:
nvd params init run.yamlAt minimum, a first run usually needs a samplesheet, an experiment ID, BLAST database settings, a deacon enrichment index, and a taxonomy directory. Those can all live in run.yaml; paths should point to locations visible to the machine or cluster worker that will execute the pipeline.
Before launching, validate the params file:
nvd params check run.yamlIf you are preparing the file on a laptop but the paths only exist on CHTC or another cluster, skip local path checks:
nvd params check run.yaml --no-check-pathsThen launch the run:
nvd run --params-file run.yaml --profile dockerYou can preview the generated Nextflow command without starting the pipeline:
nvd run --params-file run.yaml --profile docker --dry-runIf you'd rather spell out your params in the command line, that's supported too:
nvd run \
--samplesheet samplesheet.csv \
--experiment-id exp001 \
--blast-db /path/to/blast_db \
--blast-db-prefix core_nt \
--virus-index /path/to/human_infecting_viruses.k31w1.idx \
--taxonomy-dir /path/to/taxdump \
--profile dockerIf your group has a shared preset, use it for common reference paths and defaults, then keep run-specific values in your params file:
nvd run --preset chtc-defaults --params-file run.yamlCLI flags override both presets and params files, which is useful for small one-off changes:
nvd run --preset chtc-defaults --params-file run.yaml --results ./results/exp001Resume with the same inputs by adding --resume, or ask NVD to resume the last cached command:
nvd run --params-file run.yaml --resume
nvd resumeFor more on authoring params, managing presets, samplesheet helpers, secrets, and taxonomy setup, see the NVD CLI Guide.
Note: Prepend commands with pixi run when not in an active environment shell, for example pixi run nvd run --params-file run.yaml.
Direct Nextflow execution is still available, but we recommend treating it as a lower-level fallback for debugging workflow behavior. The nvd run CLI handles parameter merging, preset lookup, config discovery, CHTC setup integration, taxonomy environment handling, and friendlier validation before it launches Nextflow.
If you do need to run Nextflow directly, pass the same explicit runtime values that nvd run would pass for you. In v3 there is no --tools selector as in v2; the main workflow runs the current NVD pipeline, and optional behavior is controlled by params.
nextflow run . \
-profile docker \
--samplesheet $YOUR_SAMPLESHEET \
--experiment_id github_readme_test \
--blast_db $YOUR_REFERENCE_PATH/blast_db \
--blast_db_prefix core_nt \
--virus_index $YOUR_REFERENCE_PATH/human_infecting_viruses.k31w1.idx \
--taxonomy_dir $YOUR_TAXDUMP_PATHThis command assumes YOUR_SAMPLESHEET points to your samplesheet CSV, YOUR_REFERENCE_PATH points to the directory where the BLAST database directory and deacon index live, and YOUR_TAXDUMP_PATH points to an NCBI taxdump directory visible to the machine or worker running the pipeline.
If you are using a generated params file, direct Nextflow can consume it too:
nextflow run . -profile docker -params-file run.yamlThat said, prefer this equivalent CLI form unless you specifically need to bypass the wrapper:
nvd run --params-file run.yaml --profile dockerNVD is designed so heavyweight references are prepared before pipeline execution. The BLAST database, deacon virus index, and NCBI taxonomy dump should be available before launching a serious run, especially on CHTC or other distributed systems where worker jobs may not have the same filesystem view or outbound network access as the login node.
Use nvd taxonomy ensure or the installer/setup flow to prepare taxonomy, and use nvd taxonomy status to inspect it. On CHTC, nvd setup records the shared taxonomy location in NVD_TAXONOMY_DB, and nvd run passes that through to the pipeline.
The installer can download and arrange reference artifacts, but it does not silently persist BLAST or deacon paths as hidden runtime state. Put those paths in a params file, preset, CLI flags, or explicit Nextflow config.
NVD uses NCBI taxonomy files for BLAST annotation and LCA resolution. The taxonomy directory should contain the standard taxdump files:
nodes.dmp,names.dmp,merged.dmp- Raw NCBI taxonomy filestaxonomy.sqlite- Indexed database for fast lookups
For local CLI taxonomy commands, NVD can manage a local taxonomy directory. For pipeline runs, especially distributed runs, the taxonomy directory should be explicit and visible to worker jobs.
Useful commands:
nvd taxonomy ensure --taxonomy-dir /path/to/taxdump
nvd taxonomy status --taxonomy-dir /path/to/taxdumpYou can also set:
export NVD_TAXONOMY_DB=/shared/path/taxdumpWhen using nvd run, this environment variable is materialized into the pipeline's --taxonomy_dir parameter if you do not pass --taxonomy-dir directly.
For offline or restricted-network environments, pre-populate the BLAST database, deacon virus index, and taxonomy directory before launching the pipeline. To prevent taxonomy helpers from attempting network refreshes, set:
export NVD_TAXONOMY_OFFLINE=1Offline mode requires the taxonomy directory to already contain the expected NCBI taxdump files.
- NVD CLI Guide - Comprehensive guide to
the
nvdcommand-line tool, params files, and presets (recommended) - Installation Guide - Detailed installation
instructions using
install.sh - Direct Nextflow Examples - Traditional Nextflow command examples
- Virus Enrichment Index Guide - Notes on the deacon human-infecting virus enrichment index
- Contributor Guide - Development guidelines and best practices
- Multi-platform sequencing support: Seamlessly processes both Illumina and Oxford Nanopore data with platform-specific optimizations
- Smart contig assembly: Automatically assembles reads with SPAdes and filters contigs for optimal classification accuracy
- Two-phase BLAST verification: Uses both megablast and blastn with intelligent filtering to minimize false positives
- Least Common Ancestor (LCA) resolution: Resolves ambiguous BLAST hits by computing taxonomic consensus—either using dominant taxid assignment when one organism has strong support (>80% bitscore weight), or calculating the LCA for near-tie cases to avoid over-specificity when multiple closely-scoring hits disagree at the species level
- Advanced taxonomic filtering: Sophisticated lineage-based filtering with adjustable stringency for precise organism identification
- Human read scrubbing: Built-in capability to remove human sequences for privacy-compliant public data sharing
- LIMS data integration: Native LabKey LIMS integration with WebDAV file uploads and structured metadata management
- Comprehensive quality control: Read counting, contig metrics, and BLAST hit validation throughout the pipeline
- Flexible workflow orchestration: Mix-and-match subworkflows (nvd, gottcha, clumpify) based on research needs
- Production-ready deployment: Docker/Apptainer containerization with Pixi environment management for reproducible execution
- Intelligent error handling: Robust retry logic and graceful failure modes for reliable high-throughput processing
- SRA integration: Direct processing of NCBI SRA datasets alongside local FASTQ files
- Real-time validation: Pre-flight checks for database integrity, API connectivity, and experiment ID uniqueness
- Multi-format output: Generates taxonomic reports, FASTA sequences, and structured CSV files for downstream analysis
Coming soon!
- Marc Johnson and Shelby O'Connor, our partners in innovative pathogen monitoring from environmental samples.
- Kenneth Katz, NCBI, for developing NCBI STAT, maintaining pre-built databases for STAT, and helpful discusssions
- C. Titus Brown, for helpful discussions of using kmer classifiers as part of metagenomic workflows
- Development funded by Inkfish
See LICENSE for more information. NVD2's predecessor used the copyleft GPLv3 license, which means we have to as well. This means you’re welcome to use it, share it, and modify it, as long as any changes you distribute are also shared under the same
license.
By contributing to this project, you are thus locked into agreeing that your code will be released under GPLv3. This means:
- Share alike – if you share modified versions of this project, they must also be under GPLv3.
- Freedom to use – anyone can use the code for personal, educational, or commercial purposes, as long as they respect the license.
- Source availability – if you distribute the software (original or modified), you must also make the source code available under GPLv3.
- Community contributions – your pull requests and patches automatically become part of the GPLv3-licensed project.
- Commercial use - You’re free to use NVD2 code in commercial contexts, but you can’t combine it with proprietary software without open-sourcing that software under GPLv3 too.
- No Warranty and Liability - The software is provided as-is, without any warranty or liability.