DupCaller is a universal tool for calling somatic mutations and calculating somatic mutational burden from barcoded error-corrected next generation sequencing (ecNGS) data with matched normal (e.x. NanoSeq).
- Prerequisites
- Installation
- Pipeline
- Results
- End-to-End Examples
- Resources
- Citation
- Copyright
- Contact
DupCaller requires python>=3.10 to run. Earlier versions may be sufficient to run DupCaller but have not been tested. The complete DupCaller pipeline also requires the following tools for data preprocessing. The versions are used by the developer and other versions may or may not work.
- BWA version 0.7.17 (https://bio-bwa.sourceforge.net)
- GATK version 4.2.6 (https://github.com/broadinstitute/gatk/releases)
- Tabix for indexing compressed genomic files (recommended installation:
conda install bioconda::tabix)
The tool uses pip for installing scripts and prerequisites. We recommend creating a new environment to install DupCaller:
conda create -n DupCaller python=3.12 bioconda::tabixTo install DupCaller, simply clone this repository and install via pip:
conda activate DupCaller
git clone https://github.com/AlexandrovLab/DupCaller.git
cd DupCaller
pip install .A pre-built Docker image is available on Docker Hub at yuhecheng62/dupcaller:1.1.0-amd64.
Pull and run with Singularity:
Pull the image from Docker Hub (only needed once):
singularity pull dupcaller-1.1.0.sif docker://yuhecheng62/dupcaller:1.1.0-amd64For quick verification:
singularity exec dupcaller-1.1.0.sif DupCaller.py --helpFor installation-free execution of DupCaller commands, run all DupCaller.py commands with singularity exec and binding of current directories:
singularity exec --bind $(pwd):$(pwd) dupcaller-1.1.0.sif DupCaller.py {your commands}DupCaller uses a numpyrized reference genome to perform memory-efficient reference fetching and trinucleotide context indexing. Indexing requires a tabix-indexed BED file of short tandem repeat (STR) regions, which is used to annotate STR loci for improved indel calling near repetitive regions. To index the reference genome, run:
DupCaller.py index -f reference.fa -s str_regions.bed.gzThe command will generate three h5 files in the same folder as the reference: {reference}.ref.h5, {reference}.tn.h5 and {reference}.hp.h5, which are numpyrized reference sequences, trinucleotide contexts, and homopolymer/STR annotations, respectively. Make sure that when running other DupCaller utilities, the three files are within the same folder as the reference genome.
The STR BED file must be tabix-indexed (.bed.gz with a .tbi index).
For human reference genome hg38 and mouse reference genome mm39, we provided pre-built indexes and resource files in the Resources.
| Short | Long | Description |
|---|---|---|
| -f | --reference | Reference genome fasta file (required) |
| -s | --strbed | Tabix-indexed BED file of STR regions (required) |
We have built indexes for human reference genome GRCh38/hg38 and mouse reference genome GRCm39/mm39, and can be downloaded at:
For other reference genomes, a BED file of simple repeats is needed to build the index. The file can be obtained from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) with the following steps:
- For Genome, select the reference genome.
- For Group, select "Repeats","Variation and Repeats".
- For Table, select "repeat masker" or similar table. (Tips: the actual path to repeat masker may be different for different reference genome. Look for a table named "rmsk")
- Add filter: "repClass Does match Simple_repeat".
- For Output format, select "BED - browser extensible data".
- Input desired filename (e.g.
mm39_str.bed) and download the output BED file. - Sort the output bed file with your reference genome (Use bedtools or similar) and compress and index with bgzip:
sortBed -i str.bed -faidx reference.fa.fai > str_sorted.bed
bgzip str_sorted.bed
tabix str_sorted.bed.gzDupCaller.py trim extracts 5-prime barcodes from paired-end FASTQs:
DupCaller.py trim -i read1.fq -i2 read2.fq -p barcode_pattern -o sample_nameread1.fq/read2.fq— FASTQ files from read 1 and read 2. Both unzipped and gzip-compressed files are supported.barcode_pattern— pattern of barcodes starting from the 5-prime end, withNrepresenting a barcode base andXrepresenting a skipped base (similar notation to UMI-tools). For example, NanoSeq uses a 3-base barcode followed by 4 constant bases, so the pattern should beNNNXXXX.sample_name— prefix of output paired FASTQs. After the run completes,{sample_name}_1.fastqand{sample_name}_2.fastqwill be generated. Barcodes will be recorded in each read name as{original_read_name}:{read1_barcode}+{read2_barcode}.
If the matched normal is prepared in the same way as the sample, apply trimming with the same scheme to the matched normal FASTQs. For traditional bulk normal, trimming is not needed.
Use a DNA NGS aligner such as BWA-MEM to align the trimmed FASTQs of both sample and matched normal. GATK requires read group fields ID, SM, and PL, so adding those tags during BWA alignment is recommended. FASTQ tags must be kept — for bwa mem this requires the -C option.
bwa mem -C -t {threads} -R "@RG\tID:{sample_name}\tSM:{sample_name}\tPL:ILLUMINA" \
reference.fa {sample_name}_1.fastq {sample_name}_2.fastq \
| samtools sort -@ {threads} > {sample_name}.bam
samtools index -@ {threads} {sample_name}.bamthreads— number of cores used for aligningreference.fa— reference genome FASTA file{sample_name}_1.fastq/{sample_name}_2.fastq— trimmed FASTQ files from Step 2
Run GATK MarkDuplicates on sample and matched-normal BAMs. Optical and PCR duplicates must be treated differently in ecNGS variant calling:
- Set
--TAGGING_POLICY OpticalOnlyto differentiate optical from PCR duplicates. - Set
--DUPLEX_UMI truefor duplex UMI handling. - Set
--READ_NAME_REGEXas shown below, because the read names were modified in Step 2.
Note: Outdated versions of GATK do not have the
--DUPLEX_UMIflag. Please update GATK to the latest version if this occurs.
gatk MarkDuplicates \
-I sample.bam -O sample.mkdped.bam -M sample.mkdp_metrics.txt \
--READ_NAME_REGEX "(?:.*:)?([0-9]+)[^:]*:([0-9]+)[^:]*:([0-9]+)[^:]*$" \
--DUPLEX_UMI --TAGGING_POLICY OpticalOnly --BARCODE_TAG DBAfter preprocessing, run DupCaller.py call to call somatic mutations. Usage depends on your experimental design:
Whole genome / whole exome / reduced genome (e.g. NanoSeq) with a matched normal:
DupCaller.py call -b ${sample}.mkdped.bam -f reference.fa -o {output_prefix} \
-p {threads} -n {normal.bam} -g germline.vcf.gz -m noise_mask.bed.gzMutagenesis panel without a matched normal:
DupCaller.py call -b ${sample}.mkdped.bam -f reference.fa -o {output_prefix} \
-p {threads} -g germline.vcf.gz -m noise_mask.bed.gz -maf 0.1Note: DupCaller partitions jobs by genomic region; multithreading is less effective for small targeted panels. In this case, use at most one thread per distinct targeted region.
Input validation: DupCaller checks for the existence of all required files (BAM, reference, h5 index files, and optional files) before starting analysis, providing clear error messages for missing files.
Multi-threading: Coverage files from different threads are automatically merged post-processing with region-aware boundary detection and tabix indexing.
See the Results section for descriptions of all output files.
| Short | Long | Description |
|---|---|---|
| -b | --bam | BAM file of ecNGS data |
| -f | --reference | Reference genome FASTA file |
| -o | --output | Prefix of the output files |
These options should be understood and customized accordingly.
| Short | Long | Description | Default |
|---|---|---|---|
| -r | --regions | Contigs to consider for variant calling. For non-human species, set accordingly (e.g. for mouse: -r chr{1..19} chrX) |
chr{1..22} chrX |
| -g | --germline | Indexed germline VCF with AF field | None |
| -p | --threads | Number of threads | 1 |
| -n | --normalBams | BAM file(s) of matched normals. When unavailable, set -maf to an appropriate value (e.g. 0.1) |
None |
| -m | --noise | BED interval file(s) masking noisy positions | None |
| -R | --regionfile | Inclusive BED file specifying target regions | None |
| -maf | --maxAF | Maximum allele fraction to call a somatic mutation. Must be set when matched normal (-n) is unavailable |
1 |
| -tt | --trimF | Ignore mutations less than n bp from template ends | 7 |
| -tr | --trimR | Ignore mutations less than n bp from read ends | 7 |
The effect of changing these parameters should be evaluated before implementation.
| Short | Long | Description | Default |
|---|---|---|---|
| --naf | Maximum VAF in matched normal for a mutation to be called | 0.01 | |
| --rescue | Output discarded variants with reason in the FILTER field | False | |
| -nm | --nmflt | Filter reads with edit distance larger than this value | 5 |
| -ax | --minMeanASXS | Minimum mean AS-XS alignment score difference for a read group to be considered | 50 |
| -gaf | --germlineAfCutoff | Skip positions with germline AF above this threshold | 0.001 |
| -d | --minNdepth | Minimum coverage in normal for called variants | 10 |
These are variant calling model parameters; adjustment is unnecessary for general use.
| Short | Long | Description | Default |
|---|---|---|---|
| -AS | --amperrfile | Pre-learned error profile for amplification SBS error | None |
| -AI | --amperrfileindel | Pre-learned error profile for amplification indel error | None |
| -DS | --dmgerrfile | Pre-learned error profile for SBS damage | None |
| -DI | --dmgerrfileindel | Pre-learned error profile for indel damage | None |
| -ts | --thresholdSnv | Log likelihood ratio threshold for SNV calls | 8.5 |
| -ti | --thresholdIndel | Log likelihood ratio threshold for indel calls. Three values corresponds to the threshold for 1. homopolymer length less than 5; 2. homopolymer length 6 to 10; 3. homopolymer length more than 10 or are in STR regions. | 10.3,8.5,7 |
| -mq | --mapq | Minimum MAPQ for an alignment to be considered | 40 |
| -w | --windowSize | Genomic window size for coverage calculation and BAM partitioning | 100000 |
| -bq | --minBq | Bases with quality below this value will be set to 6 | 18 |
| -aq | --minAltQual | Minimum consensus quality of alt allele in a read group | 60 |
| --minRef | Minimum consensus quality of ref allele in a read group | 2 | |
| --minAlt | Minimum consensus quality of alt allele in a read group | 2 | |
| -z | --maxZeroQualFrac | Maximum fraction of zero-quality bases in a read family. Set to 0.1 if a noise mask is not available | 0.5 |
| -id | --indelBed | Indel enhanced Panel of Normals (ePoN) for indel calling | None |
The -m option accepts BED files and will ignore any mutations overlapping an excluded locus. Any custom BED file can be used as input.
After mutation calling, run burden estimation from the output folder:
DupCaller.py estimate -i sample -f reference.fa -r chr{1..22} chrXAdjust the -r regions according to the reference genome used.
For dNdScv coverage correction, DupCaller can output mean duplex depth per gene using the -gb option:
DupCaller.py estimate -i sample -f reference.fa -r chr{1..22} chrX -gb {target}.bedThe gene BED should have the fourth column formatted as {gene_name}_{exon_number} (e.g. tp53_1).
To re-estimate trinucleotide-corrected mutational burden in specific regions without re-running variant calling, use -rb:
DupCaller.py estimate -i sample -f reference.fa -r chr{1..22} chrX -rb {re_estimate}.bed| Short | Long | Description |
|---|---|---|
| -i | --prefix | Input prefix of results from call command |
| Short | Long | Description | Default |
|---|---|---|---|
| -f | --reference | FASTA file of reference genome | None |
| -ft | --refTrinuc | Precomputed trinucleotide composition of reference genome | None |
| Short | Long | Description | Default |
|---|---|---|---|
| -r | --regions | Contigs to consider for trinucleotide calculation | chr{1..22} chrX |
| -ot | --outTrinuc | Output the computed trinucleotide composition file for future use | None |
| -c | --clonal | Treat mutations detected in more than one molecule as one mutation | False |
| -d | --dilute | Archived — not needed in the latest version. Previously used when sample and matched normal originated from the same starting DNA material | False |
| -gb | --genebed | Gene BED file for per-gene coverage calculation | None |
| -rb | --reestimatebed | BED file for burden re-estimation in specific regions | None |
After running estimate on all samples, use DupCaller.py summarize to collate burden metrics and SBS96 profiles into multi-sample tables:
DupCaller.py summarize -i sample1 sample2 sample3 -o cohort_summary.txtThis reads {sample}_stats.txt, {sample}_sbs_burden.txt, {sample}_indel_burden.txt, and {sample}_sbs_96_corrected.txt from each sample folder and writes four output files:
| File | Description |
|---|---|
cohort_summary.txt |
One row per sample with burden metrics and library statistics |
cohort_summary_SBS96_uncorrected.txt |
96-context raw mutation counts across all samples (SigProfiler-compatible format) |
cohort_summary_SBS96_corrected.txt |
96-context trinucleotide-corrected counts across all samples |
cohort_summary_SBS96_genome.txt |
96-context estimated mutations per genome across all samples |
All SBS96 file can be directly input into SigProfilerPlotting
| Short | Long | Description |
|---|---|---|
| -i | --input | One or more sample folders (space-separated) |
| -o | --output | Output filename for the summary table (e.g. cohort_summary.txt) |
For detailed column-by-column and field-by-field descriptions of every output file, see the docs/ folder:
docs/call_outputs.md— all files fromDupCaller.py calldocs/estimate_outputs.md— all files fromDupCaller.py estimate
| File | Description |
|---|---|
{sample}_snv.vcf |
VCF of detected SNVs and MNVs |
{sample}_indel.vcf |
VCF of detected short indel mutations |
{sample}_coverage.bed.gz |
Duplex coverage depths across genomic positions. For multi-threaded runs, files from different threads are automatically merged with tabix indexing |
{sample}_coverage.bed.gz.tbi |
Tabix index for the coverage BED file |
{sample}_trinuc_by_duplex_group.txt |
Trinucleotide context counts grouped by duplex read number, used for burden estimation |
{sample}_duplex_family_strand_composition.txt |
Strand composition statistics for duplex read families |
{sample}_duplex_family_strand_composition_heatmap.pdf |
Heatmap visualization of duplex family strand composition |
{sample}_call_params.log |
Full record of all resolved parameters used for the call run |
{sample}_stats.txt |
Overall sequencing and analysis metrics |
| File | Description |
|---|---|
{sample}.amp.tn.txt |
Amplification SBS error profile by trinucleotide context |
{sample}.amp.id.txt |
Amplification indel error profile |
{sample}.dmg.tn.txt |
Damage SBS error profile by trinucleotide context |
{sample}.dmg.id.txt |
Damage indel error profile |
| File | Description |
|---|---|
{sample}_sbs_burden.txt |
SBS burden with uncorrected and corrected estimates and 95% confidence intervals |
{sample}_indel_burden.txt |
Indel burden with 95% confidence intervals, including masked and unmasked calculations |
{sample}_sbs_96_corrected.txt |
Corrected SBS counts across 96 trinucleotide contexts for signature analysis |
{sample}_sbs_burden_by_min_read_group_size.txt |
Burden estimates stratified by minimum read group size |
{sample}_duplex_allele_counts.txt |
Duplex depths and allele counts for each unique mutation |
| File | Description |
|---|---|
{sample}_sbs_96.pdf |
96-context mutational signature plots (3 pages: uncorrected counts, corrected counts, estimated mutations per genome) |
{sample}_sbs_burden_by_min_read_group_size.png |
Burden estimates across minimum read group sizes |
| File | Condition | Description |
|---|---|---|
{sample}_gene_coverage.txt |
-gb option |
Mean duplex coverage per gene for dNdScv correction |
{sample}_sbs_burden_re_estimate.txt |
-rb option |
Re-estimated burden for specific regions |
{sample}_sbs_96_corrected_re_estimate.pdf |
-rb option |
Signature plots for re-estimated regions |
The following examples assume the reference genome has already been indexed (DupCaller.py index) and that the germline VCF and noise mask are available. Replace all placeholder paths with your actual files.
SAMPLE=sample1
NORMAL=normal1
REF=/path/to/hg38.fa
GERMLINE=/path/to/gnomad.hg38.vcf.gz
NOISE=/path/to/noise_mask.bed.gz
THREADS=16
# 1. Trim barcodes
DupCaller.py trim -i ${SAMPLE}_R1.fq.gz -i2 ${SAMPLE}_R2.fq.gz -p NNNXXXX -o ${SAMPLE}
DupCaller.py trim -i ${NORMAL}_R1.fq.gz -i2 ${NORMAL}_R2.fq.gz -p NNNXXXX -o ${NORMAL}
# 2. Align
bwa mem -C -t ${THREADS} -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA" \
${REF} ${SAMPLE}_1.fastq ${SAMPLE}_2.fastq | samtools sort -@ ${THREADS} > ${SAMPLE}.bam
samtools index -@ ${THREADS} ${SAMPLE}.bam
bwa mem -C -t ${THREADS} -R "@RG\tID:${NORMAL}\tSM:${NORMAL}\tPL:ILLUMINA" \
${REF} ${NORMAL}_1.fastq ${NORMAL}_2.fastq | samtools sort -@ ${THREADS} > ${NORMAL}.bam
samtools index -@ ${THREADS} ${NORMAL}.bam
# 3. Mark duplicates
gatk MarkDuplicates -I ${SAMPLE}.bam -O ${SAMPLE}.mkdped.bam -M ${SAMPLE}.mkdp_metrics.txt \
--READ_NAME_REGEX "(?:.*:)?([0-9]+)[^:]*:([0-9]+)[^:]*:([0-9]+)[^:]*$" \
--DUPLEX_UMI --TAGGING_POLICY OpticalOnly --BARCODE_TAG DB
samtools index ${SAMPLE}.mkdped.bam
gatk MarkDuplicates -I ${NORMAL}.bam -O ${NORMAL}.mkdped.bam -M ${NORMAL}.mkdp_metrics.txt \
--READ_NAME_REGEX "(?:.*:)?([0-9]+)[^:]*:([0-9]+)[^:]*:([0-9]+)[^:]*$" \
--DUPLEX_UMI --TAGGING_POLICY OpticalOnly --BARCODE_TAG DB
samtools index ${NORMAL}.mkdped.bam
# 4. Call variants
DupCaller.py call -b ${SAMPLE}.mkdped.bam -f ${REF} -o ${SAMPLE} \
-n ${NORMAL}.mkdped.bam -g ${GERMLINE} -m ${NOISE} -p ${THREADS}
# 5. Estimate mutational burden
DupCaller.py estimate -i ${SAMPLE} -f ${REF} -r chr{1..22} chrXWhen no matched normal is available, omit -n and set -maf to cap the maximum allele frequency of called variants. A value of 0.1 is appropriate for most somatic applications to exclude common germline variants.
SAMPLE=sample1
REF=/path/to/hg38.fa
GERMLINE=/path/to/gnomad.hg38.vcf.gz
NOISE=/path/to/noise_mask.bed.gz
THREADS=16
# 1. Trim barcodes
DupCaller.py trim -i ${SAMPLE}_R1.fq.gz -i2 ${SAMPLE}_R2.fq.gz -p NNNXXXX -o ${SAMPLE}
# 2. Align
bwa mem -C -t ${THREADS} -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA" \
${REF} ${SAMPLE}_1.fastq ${SAMPLE}_2.fastq | samtools sort -@ ${THREADS} > ${SAMPLE}.bam
samtools index -@ ${THREADS} ${SAMPLE}.bam
# 3. Mark duplicates
gatk MarkDuplicates -I ${SAMPLE}.bam -O ${SAMPLE}.mkdped.bam -M ${SAMPLE}.mkdp_metrics.txt \
--READ_NAME_REGEX "(?:.*:)?([0-9]+)[^:]*:([0-9]+)[^:]*:([0-9]+)[^:]*$" \
--DUPLEX_UMI --TAGGING_POLICY OpticalOnly --BARCODE_TAG DB
samtools index ${SAMPLE}.mkdped.bam
# 4. Call variants (no matched normal; filter by maximum allele frequency)
DupCaller.py call -b ${SAMPLE}.mkdped.bam -f ${REF} -o ${SAMPLE} \
-g ${GERMLINE} -m ${NOISE} -maf 0.1 -p ${THREADS}
# 5. Estimate mutational burden
DupCaller.py estimate -i ${SAMPLE} -f ${REF} -r chr{1..22} chrXPre-built reference indexes, germline VCFs, and noise masks are available for download:
| Resource | Link/Source |
|---|---|
| Reference genome | TCGA hg38 reference file |
| Reference index | Pre-built hg38 DupCaller reference |
| Germline VCF | af-only-gnomad.hg38.vcf.gz file from the legacy GATK resource bundle. A copy of the file can be found here. |
| Noise mask | NanoSeq noise mask from Abascal et. al.. Can be obtained with approved access to the dataset |
| Resource | Link |
|---|---|
| Reference genome | UCSC mm39 reference file |
| Reference index | Pre-built mm39 DupCaller reference |
| Germline VCF | mgp_strains — includes VCF for SNPs in popular mouse strains. Use the vcf file for your strain. |
| Noise mask | NOISE.mm39.bed.gz, in-house noise mask from mouse duplex sequencing data |
Cheng, Y. et al. Improved Mutation Detection in Duplex Sequencing Data with Sample-Specific Error Profiles. bioRxiv (2025). https://doi.org/10.1101/2025.07.13.664565
Copyright (c) 2024, Yuhe Cheng [Alexandrov Lab] All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Yuhe Cheng (yuc211@ucsd.edu)