DupCaller

DupCaller is a universal tool for calling somatic mutations and calculating somatic mutational burden from barcoded error-corrected next generation sequencing (ecNGS) data with matched normal (e.x. NanoSeq).

Prerequisites

DupCaller requires python>=3.10 to run. Earlier versions may be sufficient to run DupCaller but have not been tested. The complete DupCaller pipeline also requires the following tools for data preprocessing. The versions are used by the developer and other versions may or may not work.

BWA version 0.7.17 (https://bio-bwa.sourceforge.net)
GATK version 4.2.6 (https://github.com/broadinstitute/gatk/releases)
Tabix for indexing compressed genomic files (recommended installation: conda install bioconda::tabix)

Installation

pip Installation

The tool uses pip for installing scripts and prerequisites. We recommend creating a new environment to install DupCaller:

conda create -n DupCaller python=3.12 bioconda::tabix

To install DupCaller, simply clone this repository and install via pip:

conda activate DupCaller
git clone https://github.com/AlexandrovLab/DupCaller.git
cd DupCaller
pip install .

Docker / Singularity

A pre-built Docker image is available on Docker Hub at yuhecheng62/dupcaller:1.1.0-amd64.

Pull and run with Singularity:

Pull the image from Docker Hub (only needed once):

singularity pull dupcaller-1.1.0.sif docker://yuhecheng62/dupcaller:1.1.0-amd64

For quick verification:

singularity exec dupcaller-1.1.0.sif DupCaller.py --help

For installation-free execution of DupCaller commands, run all DupCaller.py commands with singularity exec and binding of current directories:

singularity exec --bind $(pwd):$(pwd) dupcaller-1.1.0.sif DupCaller.py {your commands}

Pipeline

Step 1: Index Reference Genome

DupCaller uses a numpyrized reference genome to perform memory-efficient reference fetching and trinucleotide context indexing. Indexing requires a tabix-indexed BED file of short tandem repeat (STR) regions, which is used to annotate STR loci for improved indel calling near repetitive regions. To index the reference genome, run:

DupCaller.py index -f reference.fa -s str_regions.bed.gz

The command will generate three h5 files in the same folder as the reference: {reference}.ref.h5, {reference}.tn.h5 and {reference}.hp.h5, which are numpyrized reference sequences, trinucleotide contexts, and homopolymer/STR annotations, respectively. Make sure that when running other DupCaller utilities, the three files are within the same folder as the reference genome.

The STR BED file must be tabix-indexed (.bed.gz with a .tbi index).

For human reference genome hg38 and mouse reference genome mm39, we provided pre-built indexes and resource files in the Resources.

Parameters

Short	Long	Description
-f	--reference	Reference genome fasta file (required)
-s	--strbed	Tabix-indexed BED file of STR regions (required)

Pre-built Indexes

We have built indexes for human reference genome GRCh38/hg38 and mouse reference genome GRCm39/mm39, and can be downloaded at:

Short Tandem Repeat File

For other reference genomes, a BED file of simple repeats is needed to build the index. The file can be obtained from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) with the following steps:

For Genome, select the reference genome.
For Group, select "Repeats","Variation and Repeats".
For Table, select "repeat masker" or similar table. (Tips: the actual path to repeat masker may be different for different reference genome. Look for a table named "rmsk")
Add filter: "repClass Does match Simple_repeat".
For Output format, select "BED - browser extensible data".
Input desired filename (e.g. mm39_str.bed) and download the output BED file.
Sort the output bed file with your reference genome (Use bedtools or similar) and compress and index with bgzip:

sortBed -i str.bed -faidx reference.fa.fai > str_sorted.bed
bgzip str_sorted.bed
tabix str_sorted.bed.gz

Step 2: Trim Barcodes

DupCaller.py trim extracts 5-prime barcodes from paired-end FASTQs:

DupCaller.py trim -i read1.fq -i2 read2.fq -p barcode_pattern -o sample_name

read1.fq / read2.fq — FASTQ files from read 1 and read 2. Both unzipped and gzip-compressed files are supported.
barcode_pattern — pattern of barcodes starting from the 5-prime end, with N representing a barcode base and X representing a skipped base (similar notation to UMI-tools). For example, NanoSeq uses a 3-base barcode followed by 4 constant bases, so the pattern should be NNNXXXX.
sample_name — prefix of output paired FASTQs. After the run completes, {sample_name}_1.fastq and {sample_name}_2.fastq will be generated. Barcodes will be recorded in each read name as {original_read_name}:{read1_barcode}+{read2_barcode}.

If the matched normal is prepared in the same way as the sample, apply trimming with the same scheme to the matched normal FASTQs. For traditional bulk normal, trimming is not needed.

Step 3: Align Reads

Use a DNA NGS aligner such as BWA-MEM to align the trimmed FASTQs of both sample and matched normal. GATK requires read group fields ID, SM, and PL, so adding those tags during BWA alignment is recommended. FASTQ tags must be kept — for bwa mem this requires the -C option.

bwa mem -C -t {threads} -R "@RG\tID:{sample_name}\tSM:{sample_name}\tPL:ILLUMINA" \
    reference.fa {sample_name}_1.fastq {sample_name}_2.fastq \
    | samtools sort -@ {threads} > {sample_name}.bam
samtools index -@ {threads} {sample_name}.bam

threads — number of cores used for aligning
reference.fa — reference genome FASTA file
{sample_name}_1.fastq / {sample_name}_2.fastq — trimmed FASTQ files from Step 2

Step 4: Mark Duplicates

Run GATK MarkDuplicates on sample and matched-normal BAMs. Optical and PCR duplicates must be treated differently in ecNGS variant calling:

Set --TAGGING_POLICY OpticalOnly to differentiate optical from PCR duplicates.
Set --DUPLEX_UMI true for duplex UMI handling.
Set --READ_NAME_REGEX as shown below, because the read names were modified in Step 2.

Note: Outdated versions of GATK do not have the --DUPLEX_UMI flag. Please update GATK to the latest version if this occurs.

gatk MarkDuplicates \
    -I sample.bam -O sample.mkdped.bam -M sample.mkdp_metrics.txt \
    --READ_NAME_REGEX "(?:.*:)?([0-9]+)[^:]*:([0-9]+)[^:]*:([0-9]+)[^:]*$" \
    --DUPLEX_UMI --TAGGING_POLICY OpticalOnly --BARCODE_TAG DB

Step 5: Call Variants

After preprocessing, run DupCaller.py call to call somatic mutations. Usage depends on your experimental design:

Whole genome / whole exome / reduced genome (e.g. NanoSeq) with a matched normal:

DupCaller.py call -b ${sample}.mkdped.bam -f reference.fa -o {output_prefix} \
    -p {threads} -n {normal.bam} -g germline.vcf.gz -m noise_mask.bed.gz

Mutagenesis panel without a matched normal:

DupCaller.py call -b ${sample}.mkdped.bam -f reference.fa -o {output_prefix} \
    -p {threads} -g germline.vcf.gz -m noise_mask.bed.gz -maf 0.1

Note: DupCaller partitions jobs by genomic region; multithreading is less effective for small targeted panels. In this case, use at most one thread per distinct targeted region.

Input validation: DupCaller checks for the existence of all required files (BAM, reference, h5 index files, and optional files) before starting analysis, providing clear error messages for missing files.

Multi-threading: Coverage files from different threads are automatically merged post-processing with region-aware boundary detection and tabix indexing.

See the Results section for descriptions of all output files.

Parameters

Required

Short	Long	Description
-b	--bam	BAM file of ecNGS data
-f	--reference	Reference genome FASTA file
-o	--output	Prefix of the output files

Short	Long	Description	Default
-r	--regions	Contigs to consider for variant calling. For non-human species, set accordingly (e.g. for mouse: `-r chr{1..19} chrX`)	chr{1..22} chrX
-g	--germline	Indexed germline VCF with AF field	None
-p	--threads	Number of threads	1
-n	--normalBams	BAM file(s) of matched normals. When unavailable, set `-maf` to an appropriate value (e.g. 0.1)	None
-m	--noise	BED interval file(s) masking noisy positions	None
-R	--regionfile	Inclusive BED file specifying target regions	None
-maf	--maxAF	Maximum allele fraction to call a somatic mutation. Must be set when matched normal (`-n`) is unavailable	1
-tt	--trimF	Ignore mutations less than n bp from template ends	7
-tr	--trimR	Ignore mutations less than n bp from read ends	7

Optional

The effect of changing these parameters should be evaluated before implementation.

Short	Long	Description	Default
--naf		Maximum VAF in matched normal for a mutation to be called	0.01
--rescue		Output discarded variants with reason in the FILTER field	False
-nm	--nmflt	Filter reads with edit distance larger than this value	5
-ax	--minMeanASXS	Minimum mean AS-XS alignment score difference for a read group to be considered	50
-gaf	--germlineAfCutoff	Skip positions with germline AF above this threshold	0.001
-d	--minNdepth	Minimum coverage in normal for called variants	10

Advanced

These are variant calling model parameters; adjustment is unnecessary for general use.

Short	Long	Description	Default
-AS	--amperrfile	Pre-learned error profile for amplification SBS error	None
-AI	--amperrfileindel	Pre-learned error profile for amplification indel error	None
-DS	--dmgerrfile	Pre-learned error profile for SBS damage	None
-DI	--dmgerrfileindel	Pre-learned error profile for indel damage	None
-ts	--thresholdSnv	Log likelihood ratio threshold for SNV calls	8.5
-ti	--thresholdIndel	Log likelihood ratio threshold for indel calls. Three values corresponds to the threshold for 1. homopolymer length less than 5; 2. homopolymer length 6 to 10; 3. homopolymer length more than 10 or are in STR regions.	10.3,8.5,7
-mq	--mapq	Minimum MAPQ for an alignment to be considered	40
-w	--windowSize	Genomic window size for coverage calculation and BAM partitioning	100000
-bq	--minBq	Bases with quality below this value will be set to 6	18
-aq	--minAltQual	Minimum consensus quality of alt allele in a read group	60
--minRef		Minimum consensus quality of ref allele in a read group	2
--minAlt		Minimum consensus quality of alt allele in a read group	2
-z	--maxZeroQualFrac	Maximum fraction of zero-quality bases in a read family. Set to 0.1 if a noise mask is not available	0.5
-id	--indelBed	Indel enhanced Panel of Normals (ePoN) for indel calling	None

Germline and Noise Masks

The -m option accepts BED files and will ignore any mutations overlapping an excluded locus. Any custom BED file can be used as input.

Step 6: Estimate Mutational Burden

After mutation calling, run burden estimation from the output folder:

DupCaller.py estimate -i sample -f reference.fa -r chr{1..22} chrX

Adjust the -r regions according to the reference genome used.

Per-Gene Coverage

For dNdScv coverage correction, DupCaller can output mean duplex depth per gene using the -gb option:

DupCaller.py estimate -i sample -f reference.fa -r chr{1..22} chrX -gb {target}.bed

The gene BED should have the fourth column formatted as {gene_name}_{exon_number} (e.g. tp53_1).

Re-estimation for Specific Regions

To re-estimate trinucleotide-corrected mutational burden in specific regions without re-running variant calling, use -rb:

DupCaller.py estimate -i sample -f reference.fa -r chr{1..22} chrX -rb {re_estimate}.bed

Parameters

Required

Short	Long	Description
-i	--prefix	Input prefix of results from `call` command

Reference (choose one)

Short	Long	Description	Default
-f	--reference	FASTA file of reference genome	None
-ft	--refTrinuc	Precomputed trinucleotide composition of reference genome	None

Optional

Short	Long	Description	Default
-r	--regions	Contigs to consider for trinucleotide calculation	chr{1..22} chrX
-ot	--outTrinuc	Output the computed trinucleotide composition file for future use	None
-c	--clonal	Treat mutations detected in more than one molecule as one mutation	False
-d	--dilute	Archived — not needed in the latest version. Previously used when sample and matched normal originated from the same starting DNA material	False
-gb	--genebed	Gene BED file for per-gene coverage calculation	None
-rb	--reestimatebed	BED file for burden re-estimation in specific regions	None

Step 7: Summarize Across Samples

After running estimate on all samples, use DupCaller.py summarize to collate burden metrics and SBS96 profiles into multi-sample tables:

DupCaller.py summarize -i sample1 sample2 sample3 -o cohort_summary.txt

This reads {sample}_stats.txt, {sample}_sbs_burden.txt, {sample}_indel_burden.txt, and {sample}_sbs_96_corrected.txt from each sample folder and writes four output files:

File	Description
`cohort_summary.txt`	One row per sample with burden metrics and library statistics
`cohort_summary_SBS96_uncorrected.txt`	96-context raw mutation counts across all samples (SigProfiler-compatible format)
`cohort_summary_SBS96_corrected.txt`	96-context trinucleotide-corrected counts across all samples
`cohort_summary_SBS96_genome.txt`	96-context estimated mutations per genome across all samples

All SBS96 file can be directly input into SigProfilerPlotting

Parameters

Short	Long	Description
-i	--input	One or more sample folders (space-separated)
-o	--output	Output filename for the summary table (e.g. `cohort_summary.txt`)

Results

For detailed column-by-column and field-by-field descriptions of every output file, see the docs/ folder:

docs/call_outputs.md — all files from DupCaller.py call
docs/estimate_outputs.md — all files from DupCaller.py estimate

Core Output Files

File	Description
`{sample}_snv.vcf`	VCF of detected SNVs and MNVs
`{sample}_indel.vcf`	VCF of detected short indel mutations
`{sample}_coverage.bed.gz`	Duplex coverage depths across genomic positions. For multi-threaded runs, files from different threads are automatically merged with tabix indexing
`{sample}_coverage.bed.gz.tbi`	Tabix index for the coverage BED file
`{sample}_trinuc_by_duplex_group.txt`	Trinucleotide context counts grouped by duplex read number, used for burden estimation
`{sample}_duplex_family_strand_composition.txt`	Strand composition statistics for duplex read families
`{sample}_duplex_family_strand_composition_heatmap.pdf`	Heatmap visualization of duplex family strand composition
`{sample}_call_params.log`	Full record of all resolved parameters used for the `call` run
`{sample}_stats.txt`	Overall sequencing and analysis metrics

Error Profile Files

File	Description
`{sample}.amp.tn.txt`	Amplification SBS error profile by trinucleotide context
`{sample}.amp.id.txt`	Amplification indel error profile
`{sample}.dmg.tn.txt`	Damage SBS error profile by trinucleotide context
`{sample}.dmg.id.txt`	Damage indel error profile

Burden Estimation Files

File	Description
`{sample}_sbs_burden.txt`	SBS burden with uncorrected and corrected estimates and 95% confidence intervals
`{sample}_indel_burden.txt`	Indel burden with 95% confidence intervals, including masked and unmasked calculations
`{sample}_sbs_96_corrected.txt`	Corrected SBS counts across 96 trinucleotide contexts for signature analysis
`{sample}_sbs_burden_by_min_read_group_size.txt`	Burden estimates stratified by minimum read group size
`{sample}_duplex_allele_counts.txt`	Duplex depths and allele counts for each unique mutation

Visualization Files

File	Description
`{sample}_sbs_96.pdf`	96-context mutational signature plots (3 pages: uncorrected counts, corrected counts, estimated mutations per genome)
`{sample}_sbs_burden_by_min_read_group_size.png`	Burden estimates across minimum read group sizes

Optional / Conditional Files

File	Condition	Description
`{sample}_gene_coverage.txt`	`-gb` option	Mean duplex coverage per gene for dNdScv correction
`{sample}_sbs_burden_re_estimate.txt`	`-rb` option	Re-estimated burden for specific regions
`{sample}_sbs_96_corrected_re_estimate.pdf`	`-rb` option	Signature plots for re-estimated regions

End-to-End Examples

The following examples assume the reference genome has already been indexed (DupCaller.py index) and that the germline VCF and noise mask are available. Replace all placeholder paths with your actual files.

With Matched Normal

SAMPLE=sample1
NORMAL=normal1
REF=/path/to/hg38.fa
GERMLINE=/path/to/gnomad.hg38.vcf.gz
NOISE=/path/to/noise_mask.bed.gz
THREADS=16

# 1. Trim barcodes
DupCaller.py trim -i ${SAMPLE}_R1.fq.gz -i2 ${SAMPLE}_R2.fq.gz -p NNNXXXX -o ${SAMPLE}
DupCaller.py trim -i ${NORMAL}_R1.fq.gz -i2 ${NORMAL}_R2.fq.gz -p NNNXXXX -o ${NORMAL}

# 2. Align
bwa mem -C -t ${THREADS} -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA" \
    ${REF} ${SAMPLE}_1.fastq ${SAMPLE}_2.fastq | samtools sort -@ ${THREADS} > ${SAMPLE}.bam
samtools index -@ ${THREADS} ${SAMPLE}.bam

bwa mem -C -t ${THREADS} -R "@RG\tID:${NORMAL}\tSM:${NORMAL}\tPL:ILLUMINA" \
    ${REF} ${NORMAL}_1.fastq ${NORMAL}_2.fastq | samtools sort -@ ${THREADS} > ${NORMAL}.bam
samtools index -@ ${THREADS} ${NORMAL}.bam

# 3. Mark duplicates
gatk MarkDuplicates -I ${SAMPLE}.bam -O ${SAMPLE}.mkdped.bam -M ${SAMPLE}.mkdp_metrics.txt \
    --READ_NAME_REGEX "(?:.*:)?([0-9]+)[^:]*:([0-9]+)[^:]*:([0-9]+)[^:]*$" \
    --DUPLEX_UMI --TAGGING_POLICY OpticalOnly --BARCODE_TAG DB
samtools index ${SAMPLE}.mkdped.bam

gatk MarkDuplicates -I ${NORMAL}.bam -O ${NORMAL}.mkdped.bam -M ${NORMAL}.mkdp_metrics.txt \
    --READ_NAME_REGEX "(?:.*:)?([0-9]+)[^:]*:([0-9]+)[^:]*:([0-9]+)[^:]*$" \
    --DUPLEX_UMI --TAGGING_POLICY OpticalOnly --BARCODE_TAG DB
samtools index ${NORMAL}.mkdped.bam

# 4. Call variants
DupCaller.py call -b ${SAMPLE}.mkdped.bam -f ${REF} -o ${SAMPLE} \
    -n ${NORMAL}.mkdped.bam -g ${GERMLINE} -m ${NOISE} -p ${THREADS}

# 5. Estimate mutational burden
DupCaller.py estimate -i ${SAMPLE} -f ${REF} -r chr{1..22} chrX

Without Matched Normal

When no matched normal is available, omit -n and set -maf to cap the maximum allele frequency of called variants. A value of 0.1 is appropriate for most somatic applications to exclude common germline variants.

SAMPLE=sample1
REF=/path/to/hg38.fa
GERMLINE=/path/to/gnomad.hg38.vcf.gz
NOISE=/path/to/noise_mask.bed.gz
THREADS=16

# 1. Trim barcodes
DupCaller.py trim -i ${SAMPLE}_R1.fq.gz -i2 ${SAMPLE}_R2.fq.gz -p NNNXXXX -o ${SAMPLE}

# 2. Align
bwa mem -C -t ${THREADS} -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA" \
    ${REF} ${SAMPLE}_1.fastq ${SAMPLE}_2.fastq | samtools sort -@ ${THREADS} > ${SAMPLE}.bam
samtools index -@ ${THREADS} ${SAMPLE}.bam

# 3. Mark duplicates
gatk MarkDuplicates -I ${SAMPLE}.bam -O ${SAMPLE}.mkdped.bam -M ${SAMPLE}.mkdp_metrics.txt \
    --READ_NAME_REGEX "(?:.*:)?([0-9]+)[^:]*:([0-9]+)[^:]*:([0-9]+)[^:]*$" \
    --DUPLEX_UMI --TAGGING_POLICY OpticalOnly --BARCODE_TAG DB
samtools index ${SAMPLE}.mkdped.bam

# 4. Call variants (no matched normal; filter by maximum allele frequency)
DupCaller.py call -b ${SAMPLE}.mkdped.bam -f ${REF} -o ${SAMPLE} \
    -g ${GERMLINE} -m ${NOISE} -maf 0.1 -p ${THREADS}

# 5. Estimate mutational burden
DupCaller.py estimate -i ${SAMPLE} -f ${REF} -r chr{1..22} chrX

Resources

Pre-built reference indexes, germline VCFs, and noise masks are available for download:

Human (GRCh38/hg38)

Resource	Link/Source
Reference genome	TCGA hg38 reference file
Reference index	Pre-built hg38 DupCaller reference
Germline VCF	af-only-gnomad.hg38.vcf.gz file from the legacy GATK resource bundle. A copy of the file can be found here.
Noise mask	NanoSeq noise mask from Abascal et. al.. Can be obtained with approved access to the dataset

Mouse (GRCm39/mm39)

Resource	Link
Reference genome	UCSC mm39 reference file
Reference index	Pre-built mm39 DupCaller reference
Germline VCF	mgp_strains — includes VCF for SNPs in popular mouse strains. Use the vcf file for your strain.
Noise mask	NOISE.mm39.bed.gz, in-house noise mask from mouse duplex sequencing data

Citation

Cheng, Y. et al. Improved Mutation Detection in Duplex Sequencing Data with Sample-Specific Error Profiles. bioRxiv (2025). https://doi.org/10.1101/2025.07.13.664565

Copyright

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Contact

Yuhe Cheng (yuc211@ucsd.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
docs		docs
nextflow		nextflow
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

DupCaller

Table of Contents

Prerequisites

Installation

pip Installation

Docker / Singularity

Pipeline

Step 1: Index Reference Genome

Parameters

Pre-built Indexes

Short Tandem Repeat File

Step 2: Trim Barcodes

Step 3: Align Reads

Step 4: Mark Duplicates

Step 5: Call Variants

Parameters

Required

Recommended

Optional

Advanced

Germline and Noise Masks

Step 6: Estimate Mutational Burden

Per-Gene Coverage

Re-estimation for Specific Regions

Parameters

Required

Reference (choose one)

Optional

Step 7: Summarize Across Samples

Parameters

Results

Core Output Files

Error Profile Files

Burden Estimation Files

Visualization Files

Optional / Conditional Files

End-to-End Examples

With Matched Normal

Without Matched Normal

Resources

Human (GRCh38/hg38)

Mouse (GRCm39/mm39)

Citation

Copyright

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages