CNVkit Pipeline

CNVkit pipeline

The pipeline decouples cnvkit.py batch so that we get detailed control over each step.

Pipeline overview

The pipeline consists of the following steps:

cnvkit.py access to generate a BED file of accessible regions if not given
Guess baits from bam files if baitfile is not given
cnvkit.py autobin to generate target and antitarget files
cnvkit.py coverage to generate coverage files for target region
cnvkit.py coverage to generate coverage files for antitarget region
cnvkit.py reference to generate a reference.cnn file using normal samples (or a "flat" reference file if no normal samples are given)
cnvkit.py fix to combine the uncorrected target and antitarget coverage tables (.cnn) and correct for biases in regional coverage and GC content, according to the given reference.
cnvkit.py segment to infer discrete copy number segments from the given coverage table:
cnvkit.py call to call copy number alterations from the given segments file
cnvkit.py scatter to generate scatter plots of log2 ratios
cnvkit.py diagram to generate a diagram of copy number alterations on all chromosomes
cnvkit.py heatmap to generate a heatmap of segment-level log2 ratios
cnvkit.py heatmap to generate a heatmap of bin-level log2 ratios

Input files

metafile: a tab-separated file (see the next section) containing sample information
baitfile: Potentially targeted genomic regions. E.g. all possible exons for the reference genome. This is optional when method is wgs.
accfile: The accessible genomic regions. If not given, use cnvkit.py access to generate one. You can control the details by configuration items [CNVkitAccess.envs]

Configurations

Special configurations

access_excludes: File(s) with regions to be excluded for cnvkit.py access.
guessbaits_guided: Whether to use guided mode for guessing baits.
metacols: The column names for each type of information in metafile
group: The column name in the metafile that indicates the sample group Default: Group
purity: The column name in the metafile that indicates the sample purity. Default: Purity
snpvcf: The column name in the metafile that indicates the path to the SNP VCF file. Default: SnpVcf
bam: The column name in the metafile that indicates the path to the BAM file. Default: Bam
vcf_sample_id: The column name in the metafile that indicates the sample ID in the VCF file. Default: VcfSampleId
vcf_normal_id: The column name in the metafile that indicates the normal sample ID in the VCF file. Default: VcfNormalId
sex: The column name in the metafile that indicates the sample sex. Default: Sex
guess_baits: The column name in the metafile that indicates whether to guess the bait file from the bam files. Default: GuessBaits
guessbaits: Guess the bait file from the bam files, either guided or unguided. If False, baitfile is used. Otherwise, if baitfile is given, use it (guided), otherwise use accfile (unguided). The bam files with metacols.guess_baits column set to True, TRUE, true, 1, Yes, YES, or yes will be used to guess the bait file.
case: The group name of samples in metacols.group to call CNVs for. If not specified, use all samples. In such a case, control must not be specified, as we are using a flat reference.
control: The group name of samples in metacols.group to use as reference if not specified, use a flat reference.

Global configurations

The options that are used by multiple processes (can be overriden individually by [<proc>.envs.xxx]):

cnvkit: the path to the cnvkit.py executable, defaults to config.exe.cnvkit from ./.biopipen.toml or ~/.biopipen.toml.
rscript: Path to the Rscript excecutable to use for running R code. Requires DNAcopy to be installed in R, defaults to config.lang.rscript
samtools: Path to samtools, used for guessing bait file.
convert: Linux convert command to convert pdf to png So that they can be embedded in the HTML report.
ncores: number of cores to use, defaults to config.misc.ncores
reffa: the reference genome (e.g. hg19.fa), defaults to config.ref.reffa Used by CNVkitAccess, CNVkitAutobin and CNVkitReference
annotate: Use gene models from this file to assign names to the target regions. Format: UCSC refFlat.txt or ensFlat.txt file (preferred), or BED, interval list, GFF, or similar. Defaults to config.ref.refflat
short_names: Reduce multi-accession bait labels to be short and consistent
method: Sequencing protocol: hybridization capture ('hybrid'), targeted amplicon sequencing ('amplicon'), or whole genome sequencing ('wgs'). Determines whether and how to use antitarget bins.
male_reference: Use or assume a male reference (i.e. female samples will have +1 log-CNR of chrX; otherwise male samples would have -1 chrX). Used by CNVkitReference, CNVkitCall, CNVkitHeatmapCns and CNVkitHeatmapCnr.
drop_low_coverage: Drop very-low-coverage bins before segmentation to avoid false-positive deletions in poor-quality tumor samples. Used by CNVkitSegment and CNVkitCall
no_gc: Skip GC correction for cnvkit.py reference/fix.
no_edge: Skip edge-effect correction for cnvkit.py reference/fix.
no_rmask: Skip RepeatMasker correction for cnvkit.py reference/fix. no_* options are used by CNVkitReference and CNVkitFix
min_variant_depth: Minimum read depth for a SNV to be displayed in the b-allele frequency plot. Used by CNVkitSegment and CNVkitCall
zygosity_freq: Ignore VCF's genotypes (GT field) and instead infer zygosity from allele frequencies. Used by CNVkitSegment and CNVkitCall

Process-specific configurations

The options that are used by a single process. See the process-specific documentation for details. You can configure them by [<proc>.envs.xxx] in the config file.

The metafile

A metafile should be with the following columns:

Sample: The sample_id used for target/antitarget files. If not provided, the sample_id will be the first part of basename of the bam file. For exapmle: D123.tumor.bam -> D123
<bam>: The path to the bam file, better using absolute path.
<group>: The type of the sample, defining the tumor/normal samples.
<sex>: Guess each sample from coverage of X and Y chromosomes if not given.
<purity>: Estimated tumor cell fraction, a.k.a. purity or cellularity.
<snpvcf>: file name containing variants for segmentation by allele frequencies.
<vcf_sample_id>: Sample ID in the VCF file.
<vcf_normal_id>: Normal sample ID in the VCF file.
<guess_baits>: Guess the bait file from the bam file