ARGprep Pipeline

This repository provides a Snakemake workflow for processing AnchorWave MAFs directly into per-contig site outputs. The workflow emits all-sites VCFs, variant-only VCFs, and BED masks from the alignments. Written with the aid of Codex and Claude. Note that v1.0 was a major rewrite from v0.4, and no longer uses Tassel or GATK.

If you use this please cite:

Ross-Ibarra, J. 2026. ARGprep: A pipeline to prepare pairwise whole-genome alignments for ancestral recombination graph estimation. doi: 10.5281/zenodo.19655050

If your use case is pairwise variant discovery (SNPs, large indels, inversions) rather than ARG-ready all-sites output, wgatools is a potential alternative. See WGATOOLS_COMPARISON.md for a detailed comparison of the two approaches.

Requirements

Conda
The environment defined in argprep.yml

Setup

conda env create -f argprep.yml
conda activate argprep

Configure

Create or edit a config file such as options.yaml. The workflow requires --configfile and will fail if it is omitted.

Required keys:

maf_dir: directory containing *.maf or *.maf.gz
reference_fasta: reference FASTA path

Optional path keys:

results_dir: output directory (default: results)

Optional controls (defaults shown):

max_missing_count - no default; see missingness thresholds below
max_missing_fraction - no default; see missingness thresholds below
mask_indels: true - mask reference positions overlapped by a deletion in any sample (see NOTES.md for exact semantics)
mask_indel_adjacent_snps: true - also mask SNPs immediately flanking an indel (no effect when mask_indels: false)
allow_multiallelic_snps: true - retain sites with more than two alleles
add_ref: false - append a synthetic REF sample (genotype 0) to both VCFs
summary_window_bp: 100000 - window size in bp for binned per-contig plots in summary.html (this does not affect the per-MAF tables)

SLURM resource overrides (for the direct_maf_sites rule):

maf_threads: 2
maf_mem_mb: 48000
maf_time: "24:00:00"

SLURM profile keys (required when using --profile profiles/slurm):

slurm_account
slurm_partition

Advanced override:

contigs: restrict the run to specific contigs instead of using the automatic shared-contig behavior
samples: restrict the run to specific sample basenames instead of using all *.maf / *.maf.gz files in maf_dir

Contig and sample selection behavior:

If samples is omitted, samples are auto-discovered from both *.maf and *.maf.gz in maf_dir.
If both <sample>.maf and <sample>.maf.gz exist, <sample>.maf is used.
If contigs is omitted, the workflow uses the intersection of contigs present in all selected MAFs.
Requested contigs are matched to reference .fai contigs with normalization (for example chr01 can map to 1 when unambiguous).
Requested contigs that are unmatched or ambiguous after normalization are skipped.
The workflow errors only if no contigs remain after resolution.

Example CLI override:

snakemake -j 8 --configfile options.yaml --config samples='["sampleA","sampleB"]' contigs='["chr1","chr2"]'

Missingness thresholds:

max_missing_count is an absolute number of missing samples allowed at a retained site.
max_missing_fraction is a fraction of samples allowed to be missing.
If both are set, the workflow uses the stricter threshold.
The fraction is converted to a count with downward truncation. For example, with 10 samples, 0.15 allows 1 missing sample.
If neither is set, the default is 0 - any site where even one sample is unaligned or missing is masked. Set one of these options explicitly if you want to retain sites with partial coverage.
A sample counts as missing at a site if it has no alignment block covering that position, carries a gap (-), an N, or any other non-ACGT character. Gaps always contribute to the missing-sample count regardless of the mask_indels setting.

Indel masking is documented in detail in NOTES.md, including how mask_indels interacts with the missingness filter.

Reference-sample behavior:

add_ref: true appends a synthetic REF sample (genotype 0 at every retained site) to both final VCFs.

Run

Local:

snakemake -j 8 --configfile options.yaml

SLURM:

snakemake --profile profiles/slurm --configfile options.yaml

When using the SLURM profile, set slurm_account and slurm_partition in your config file. Slurm defaults for other resources are defined in profiles/slurm/config.yaml. Parsing the MAFs is the most computationally expensive step in the pipeline, and direct-maf rule resources can be overridden in options.yaml (maf_threads, maf_mem_mb, maf_time).

Try it on the bundled example

example_data/ ships with a small simulated dataset (example.maf/, example.reference.fa) and a matching options.yaml. From the repo root:

snakemake -j 4 --configfile example_data/options.yaml

Outputs land in example_results/. To regenerate the example data from scratch, see the Simulation Helper section.

Outputs

Outputs are written under results/ by default (or under results_dir if provided):

sites/combined.<contig>.all_sites.vcf — every retained site (invariant + variant) that passed all filters; INFO=SC=invariant|variant distinguishes the two
sites/combined.<contig>.vcf — variant-only subset of all_sites.vcf
sites/combined.<contig>.mask.bed — merged BED intervals for masked positions
sites/combined.<contig>.site_summary.tsv — per-contig counts (see table below)
sites/combined.<contig>.<sample>.missing.bed — per-sample missing regions used by per-MAF summary stats
summary.html — genome-wide overview plus per-MAF tables and per-contig per-MAF breakdowns

Both VCFs share the same header and use a single haploid GT per sample (0 for the REF allele, 1/2/... for ALTs in ALT order, . for missing). INFO carries NS (non-missing samples), MS (missing samples), and SC (invariant or variant). All retained sites are emitted with FILTER=PASS; filtered-out positions appear in the BED mask, not the VCFs.

The site_summary.tsv contains one metric per row with columns metric and value:

metric	description
`contig`	contig name
`contig_length`	contig length in bp
`samples`	number of samples
`allowed_missing`	effective missing-sample threshold used
`all_sites`	retained sites (invariant + variant)
`variants`	retained variant sites
`invariant`	retained invariant sites
`masked_total`	total masked positions
`masked_intervals`	number of merged BED intervals in the mask
`masked_missingness`	positions masked due to too many missing samples
`masked_indel`	positions masked due to indel overlap or adjacency
`masked_multiallelic`	positions masked due to more than two alleles
`masked_no_alignment`	positions masked because at least one sample had no alignment
`masked_ref_non_acgt`	reference positions with non-ACGT bases (always masked)

The pipeline still validates that retained sites plus the mask span each contig exactly, but that check is now internal and is no longer written as a separate coverage.txt file.

Testing

pytest -q

Simulation Helper

The repository includes scripts/simulate_msprime_indels.py for generating haploid test datasets with msprime SNP variation plus branch-based indels on the tree sequence. Note that these simulations are not intended to be evolutionarily accurate, but simply to give a reasonable example data.

Example:

python scripts/simulate_msprime_indels.py \
  --sequence-length 1000000 \
  --num-samples 8 \
  --theta 0.01 \
  --rho 0.01 \
  --ne 10000 \
  --indel-rate 1e-8 \
  --indel-lambda 0.001 \
  --seed 8675309 \
  --out-prefix example_data/example

Outputs:

<prefix>.reference.fa
<prefix>.samples.fa
<prefix>.indels.tsv
<prefix>.summary.tsv
<prefix>.maf/

Summary fields include:

seed
sequence_length
reference_bp_with_indel_in_ge1_sample
total_snps
snps_without_overlapping_indel

Name		Name	Last commit message	Last commit date
Latest commit History 311 Commits
example_data		example_data
profiles/slurm		profiles/slurm
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
NOTES.md		NOTES.md
README.md		README.md
Snakefile		Snakefile
WGATOOLS_COMPARISON.md		WGATOOLS_COMPARISON.md
argprep.yml		argprep.yml
changelog.md		changelog.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARGprep Pipeline

Requirements

Setup

Configure

Run

Try it on the bundled example

Outputs

Testing

Simulation Helper

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARGprep Pipeline

Requirements

Setup

Configure

Run

Try it on the bundled example

Outputs

Testing

Simulation Helper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages