Skip to content

MultipathogenGenomics/dampa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

107 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dampa

Dampa: Diversity Aware Metagenomic Panel Assignment

Targeted metagenomics is an approach that uses sets of oligonucleotide probes (up to 100,000s) to selectively sequence targeted loci or complete genomes from hundreds of pathogen species, removing the need to identify or isolate the pathogen first. When designing a probeset for targeted metagenomics, the optimal solution includes the minimum number of probes required to effectively enrich the desired set of target organisms or loci. DAMPA (Diversity Aware Metagenomic Panel Assignment), is a user-friendly probeset design pipeline that combines novel target selection algorithms with existing probe design tools to minimise probe number while maintaining unbiased coverage of target genomes. The key approach used by DAMPA is the generation of a pangenome graph to describe the diversity of targeted loci. A pangenome graph is a representation of all unique sequences in a species, or in this case, a dataset. Crucially, conserved regions are collapsed while diverse regions are separated. The graph therefore presents the ideal target for probe design as it will ensure that probes have sufficient identity to bind to all genomic regions regardless of their diversity and natively incorporates support for recombination. DAMPA applies existing probe design tools to the graph to design probes while also ensuring all regions of input genomes are covered.

The core design pipeline:

  • input genome filtering
  • pangraph to generate graph
  • linear chains of graph nodes merged
  • consensus of each node in graph shredded to generate 120bp probes
  • refine probe set relative to input genomes with probetools
  • generate summary plots and statistics

dampa pipeline

Dependencies

  • python >=3.10
  • biopython >=1.84
  • pandas >=2.0
  • matplotlib >= 3.10
  • seaborn >=0.13
  • blast >=2.16
  • vsearch >=2

Installation

conda

conda will install all dependencies for you

conda install -c bioconda -c conda-forge dampa

pip (must have python/pip installed)

using pip you must install blast and vsearch yourself

git clone https://github.com/MultipathogenGenomics/dampa.git

cd dampa

pip install .

Anaconda-Server Badge Anaconda-Server Badge DOI

Usage

Quickstart

dampa design

Design probes for genomes in multifasta

dampa design -g <genomes.fasta> -o <output_folder_path> -p <output_prefix_for_files>

dampa eval

Evaluate probes in multifasta against genomes in multifasta

dampa eval -g <genomes.fasta> -q <probes_to_eval.fasta> -o <output_folder_path> -p <output_prefix_for_files>

Design Probes

dampa design is the core module of dampa used to generate probe sets

dampa design [options]

Input/Output options

  • -g, --input: Either folder containing individual genome fasta files OR a single fasta file containing all genomes (files must end in .fna, .fa or .fasta) (required)
  • --maxnonspandard: maximum proportion of genome that can be non ATGC (0-1) (default: 0.01)
  • -o, --outputfolder: path to output folder (default: current working directory)
  • -p, --outputprefix: prefix for all output files and folders (default: dampa_run)

General settings

  • -l, --probelen: length of output probes (default: 120)
  • -s, --probestep: step of probes (for no overlap set to same as probelen) (default: 120)
  • --skipsubambig: do NOT substitute ambiguous nucleotides (by default N or other ambiguous nucleotides are substituted for ATGC in a random selection weighted by proportions in input genomes)
  • --shannonthresh: minimum shannon entropy of probes and reported coverage regions (filters out low complexity probes/regions) (default: 1.5)

Pangraph settings - these modify the generation of the underlying pangenome graph used to generate probes

  • --pangraphident: Pangenome percentage identity setting allowable values are 5,10 or 20 (default: 20)
  • --pangraphalpha: Energy cost for splitting a block during alignment merger. Controls graph fragmentation (default: 0)
  • --pangraphbeta: Energy cost for diversity in the alignment. A high value prevents merging of distantly-related sequences in the same block (default: 6.666)
  • --pangraphlen: Minimum length of a node to allow in pangenome graph (default: 90)
  • --pangraphstrict: enable the -S strict identity option which limits merges to 1/pangraphbeta divergence
  • --pangraphdepth: Minimum depth of a node to allow in pangenome graph. Nodes with depth below this will be removed from the graph. Set to 0 to not remove any nodes based on depth (default: 0)

Probetools settings - these modify the settings for probetools which is used to fill in gaps of genome coverage after pangenome probe design

  • --probetoolsidentity: Minimum identity in probe match to target to call probe binding (default: 80)
  • --probetoolsalignmin: Minimum length (bp) of probe-target binding to allow call of binding (default: 90)
  • --probetools0covnmin: Minimum length (bp) of 0 coverage region in input genomes to trigger design of additional probes (default: 20)
  • --maxambig: The maximum number of ambiguous bases allowed in a probe (default: 10)
  • --nodust: Do not run low complexity filter in BLAST (within probetools)

cdhit preclustering settings - these settings modify the use of cdhit to precluster input genomes before pangraph generation this can significantly speed up computation but can risk missing small divergent regions.

  • --clustering_inputno_trigger: if number of input sequences exceeds this number then cdhit will be used to deduplcate genomes above 99.9 percent identity. Set to 0 to always cluster (default: 5000)
  • --clusterident: Minimum identity to cluster genomes (default: 0.999)
  • --clustercov: Minimum coverage of genomes over which clusterident must apply (0-1) (default: 1)

outlier removal settings - these settings facilitate the identification and removal of outlier genomes that may represent contamination, mislabelling or artificially modified genomes.

  • --remove_outliers: perform initial high threshold clustering followed by low threshold clustering of representatives, remove low level clusters composed of one high level cluster with <= outliersizelimit members
  • --outliersizelimit: In two step clustering if a cluster is <=outliersizelimit at both high and low identity then it is treated as an outlier (default: 1)
  • --outlierclusterident: outlier identity threshold, i.e. if a cluster is <outlierclusterident and <=outliersizelimit members it is treated as an outlier (default: 0.85)
  • --outlierclustercov: Minimum coverage of genomes over which clusterident must apply (0-1) (default: 1.0)

target generation settings - these settings modify generation of target fasta files for use in downstram analyses of targeted metagenomic data (with castanet)

  • --generate_targets: generate target sequences from pangenome graph nodes
  • --subnodeidentitythresh: identity threshold at which sequences within a pangenome node are clustered to generate targets (default: 0.95)
  • --target_unioncov: minimise pangenome graph size for target generation by removing nodes represented elsewhere in pieces
  • --graphid: ID to be used in naming graph in output file headers
  • --nthresh: proportion of Ns allowed in any given graph node to be included in targets (default: 0.1)

small seq padding settings

A pangraph node of length 110 will be padded with 10 Ts to make it 120bp

  • --skip_padding: do not generate additional probes for pangenome nodes between pangraphlen and probelen in length (i.e. if pangraphlen is 100 abd probelen is 120 a pangraph node of length 110 will be padded with 10 Ts to make it 120bp)
  • --padding_nuc: nucleotide to use for padding probes to args.probelen (choices: A, T, C, G) (default: T)
  • --minlenforpadding: minimum length for a pancontig for it to be padded (default: 90)

summary descriptions settings

  • --skip_summaries: do NOT run visualisation generation of dampa probes relative to input genomes
  • --maxdepth_describe: Maximum depth of probe coverage to describe separately (default: 1)
  • --report0covperc: threshold above which genomes are reported as having too much of their genome not covered by any probes (default: 1)

Additional settings

  • --skip_probetoolsfinal: do NOT run final probe design step (only produce probes purely from the pangenome)
  • --unioncov: minimise pangenome graph size by removing nodes represented elsewhere in sections
  • -t, --threads: number of threads (default: 1)
  • --keeplogs: keep logs containing output from pangraph and probetools
  • --keeptmp: keep intermediate files from pangraph and probetools
  • --version: print version and exit
  • --maxdiv: use new maxdiv pangraph version

Evaluate Probes

dampa eval will evaluate the performance of a probe set against a set of genomes and generate summary statistics and visualisations

dampa eval [options]

Input/Output options

  • -g, --input: Genomes to check probe coverage. If genomes either folder containing individual genome fasta files OR a single fasta file containing all genomes (files must end in .fna, .fa or .fasta). If capture file then a pt file from a previous pangraph design or pangraph eval run (required)
  • --inputtype: input type, either genomes or capture (choices: genomes, capture) (default: genomes)
  • -q, --probes: Fasta file containing probes to evaluate (files must end in .fna, .fa or .fasta) (required)
  • --filtnonstandard: remove genomes with non standard nucleotides i.e. not A, T, G, C or N
  • -o, --outputfolder: path to output folder (default: current working directory)
  • -p, --outputprefix: prefix for all output files and folders (default: probebench_run)

Probetools settings

  • --probetoolsidentity: Minimum identity in probe match to target to call probe binding (default: 80)
  • --probetoolsalignmin: Minimum length (bp) of probe-target binding to allow call of binding (default: 90)
  • --probetools0covnmin: Minimum length (bp) of 0 coverage region in input genomes to trigger design of additional probes (default: 20)
  • --nodust: Do not run low complexity filter in BLAST (within probetools)
  • -l, --probelen: length of output probes (default: 120)
  • --shannonthresh: minimum shannon entropy of probes and reported coverage regions (filters out low complexity probes/regions) (default: 1.5)

Additional settings

  • -t, --threads: number of threads (default: 1)
  • --keeplogs: keep logs containing output from pangraph and probetools
  • --keeptmp: keep intermediate files from pangraph and probetools
  • --maxdepth_describe: Maximum depth of probe coverage to describe separately (default: 1)
  • --report0covperc: threshold above which genomes are reported as having too much of their genome not covered by any probes (default: 1)
  • --version: print version and exit

Generate Targets

dampa targets will generate target sequences from a previously generated dampa pangenome graph json file

dampa targets [options]

General settings

  • -j, --inputjson: path to dampa design json arguments file (required)
  • -o, --outputfolder: path to output folder (default: current working directory)
  • -p, --outputprefix: prefix for all output files and folders (default: probebench_run)
  • -l, --probelen: length of output probes (default: 120)
  • --nthresh: proportion of Ns allowed in any given graph node to be included in targets (default: 0.1)
  • --keeplogs: keep logs containing output from pangraph and probetools
  • -t, --threads: number of threads (default: 1)
  • --subnodeidentitythresh: identity threshold at which sequences within a pangenome node are clustered to generate targets (default: 0.95)
  • --target_unioncov: minimise pangenome graph size for target generation by removing nodes represented elsewhere in pieces
  • --graphid: ID to be used in naming graph in output file headers

Outputs

Design

_probes.fasta

The final probe set designed by dampa design

_pergenome_stats_of_perc_coverage.csv

For each the proportion of the genome with no probe coverage (0), exactly 1 probe depth (1) and more than one prove depth (>1), also the mean coverage depth across the genome. By default coverage is calculated by finding blast hits of probes in the genome with over 90bp length and 85% identity.

_summary_stats_of_perc_coverage.csv

Min, Max, Mean and Median of statistics in _pergenome_stats_of_perc_coverage. Additionally a count of total probe numbers and number of genomes that "failed" according to the missing coverage theshold defined with --report0covperc

_perc_genome_cov_boxplots.pdf

A plot of each genomes proportion of 0, 1 and >1 genome coverage including a boxplot for each category.

_genomecov_lennorm.pdf

A line plot where each strain is a line that shows probe depth across the genome using a sliding 120bp window. Genomes are normalised for length but not aligned (Truncated genomes may look odd).

Eval

Outputs are the same as dampa design without _probes.fasta

About

Dampa: Diversity Aware Metagenomic Panel Assignment

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages