runAbsoluteCN: Run PureCN implementation of ABSOLUTE

Description

This function takes as input tumor and normal control coverage and allelic fractions of germline variants and somatic mutations. Coverage data is provided in GATK DepthOfCoverage format, allelic fraction in VCF format (e.g. obtained by MuTect). Normal control does not need to be matched (from the same patient). In case VCF does not contain somatic status, it should contain dbSNP and optionally COSMIC annotation. Returns purity and ploidy combinations, sorted by likelihood score. Provides copy number and LOH data, by both gene and genomic region.

Usage

runAbsoluteCN(gatk.normal.file = NULL, gatk.tumor.file,  log.ratio = NULL, seg.file = NULL, seg.file.sdev = 0.4,  vcf.file = NULL, genome = "hg19", sex = c("?",  "F", "M"), fun.filterVcf = filterVcfMuTect,  args.filterVcf = list(), fun.setPriorVcf = setPriorVcf,  args.setPriorVcf = list(), fun.segmentation = segmentationCBS,  args.segmentation = list(), fun.focal = findFocal,  args.focal = list(), sampleid = NULL, min.ploidy = 1,  max.ploidy = 6, test.num.copy = 0:7, test.purity = seq(0.05,  0.95, by = 0.01), prior.purity = rep(1, length(test.purity))/length(test.purity),  max.candidate.solutions = 15, candidates = NULL,  coverage.cutoff = 15, max.non.clonal = 0.2, max.homozygous.loss = 0.1,  iterations = 30, log.ratio.calibration = 0.25,  gc.gene.file = NULL, filter.lowhigh.gc.exons = 0.001,  filter.targeted.base = 4, max.logr.sdev = 0.75,  max.segments = 200, plot.cnv = TRUE, verbose = TRUE,  post.optimize = FALSE, ...)

Arguments

gatk.normal.file

GATK coverage file of normal control (optional if log.ratio is provided - then it will be only used to filter low coverage exons). Should be already GC-normalized. Needs to be either a file name or data read with the readCoverageGatk function.

gatk.tumor.file

GATK coverage file of tumor. Should be already GC-normalized. Needs to be either a file name or data read with the readCoverageGatk function.

log.ratio

Copy number log-ratios for all exons in the coverage files. If NULL, calculated based on coverage files.

seg.file

Segmented data. Optional, to support matched SNP6 data. If null, use coverage files or log.ratio to segment the data.

seg.file.sdev

If seg.file provided, the log-ratio standard deviation, used to model likelihood of sub-clonal copy number events.

vcf.file

VCF file, tested with MuTect output files. Optional, but typically needed to select between local optima of similar likelihood. Can also be a CollapsedVCF, read with the readVcf function. Requires a DB info flag for dbSNP membership. The default fun.setPriorVcf function will also look for a Cosmic.CNT slot, containing the hits in the COSMIC database. Again, do not expect very useful results without a VCF file.

genome

Genome version, required for the readVcf function.

sex

Sex of sample. If ?, detect.

fun.filterVcf

Function for filtering variants. Expected output is a list with elements vcf (CollapsedVCF), flag (TRUE/FALSE) and flag_comment (string). The flags will be added to the output data and can be used to warn users, for example when samples look too noisy. Default filter will remove variants flagged by MuTect, but will keep germline variants. If ran in matched normal mode, it will by default use somatic status of variants and filter non-somatic calls with allelic fraction significantly different from 0.5 in normal.

args.filterVcf

Arguments for variant filtering function. Arguments vcf, tumor.id.in.vcf, coverage.cutoff and verbose are required in the filter function and are automatically set (do NOT set them here again).

fun.setPriorVcf

Function to set prior for somatic status for each variant in the VCF.

args.setPriorVcf

Arguments for somatic prior function.

fun.segmentation

Function for segmenting the copy number log-ratios. Expected return value is a list with elements seg (the segmentation) and size (the size in bp for all segments).

args.segmentation

Arguments for segmentation function. Arguments normal, tumor, log.ratio, plot.cnv, coverage.cutoff, sampleid, vcf, tumor.id.in.vcf, verbose are required in the segmentation function and automatically set (do NOT set them here again).

fun.focal

Function for identifying focal amplifications.

args.focal

Arguments for focal amplification function.

sampleid

Sample id, provided in output files etc.

min.ploidy

Minimum ploidy to be considered.

max.ploidy

Maximum ploidy to be considered.

test.num.copy

Copy numbers tested in the grid search. Note that focal amplifications can have much higher copy numbers, but they will be labeled as subclonal (because they do not fit the integer copy numbers).

test.purity

Considered tumor purity values.

prior.purity

Priors for purity if they are available. Only change when you know what you are doing.

max.candidate.solutions

Number of local optima considered in optimization and variant fitting steps. If there are too many local optima, it will use specified number of top candidate solutions, but will also include all optima close to diploid, because silent genomes have often lots of local optima.

candidates

Candidates to optimize from a previous run (return.object$candidates). If NULL, do 2D grid search and find local optima.

coverage.cutoff

Minimum exon coverage in both normal and tumor. Exons with lower coverage are ingored. The cutoff choice depends on the expected purity and overall coverage. High purity samples might need a lower cutoff to call homozygous deletions. If an exon.weigh.file (below) is NOT specified, it is recommended to set a higher cutoff (e.g. 20) to remove noise from unreliable exon measurements.

max.non.clonal

Maximum genomic fraction assigned to a subclonal copy number state.

max.homozygous.loss

Maximum genomic fraction assigned to homozygous loss. This is set to a fairly high default value to not exclude correct solutions, especially in noisy segmentations.

iterations

Maximum number of iterations in the Simulated Annealing copy number fit optimization.

log.ratio.calibration

re-calibrate log-ratios in the window sd(log.ratio)*log.ratio.calibration.

gc.gene.file

A mapping file that assigns GC content and gene symbols to each exon in the coverage files. Used for generating gene level calls. First column in format CHR:START-END. Second column GC content (0 to 1). Third column gene symbol.

filter.lowhigh.gc.exons

Quantile q (defines lower q and upper 1-q) for removing exons with outlier GC profile. Assuming that GC correction might not have been worked on those. Requires gc.gene.file.

filter.targeted.base

Exclude exons with targeted base (size) smaller than this cutoff. This is useful when the same interval file was used to calculate GC content. For such small exons, the GC content is likely very different from the true GC content of the probes.

max.logr.sdev

Flag noisy samples with segment log-ratio standard deviation larger than this. Assay specific and needs to be calibrated.

max.segments

Flag noisy samples with a large number of segments. Assay specific and needs to be calibrated.

plot.cnv

Generate segmentation plots.

verbose

Verbose output.

post.optimize

Optimize purity using final SCNA-fit and SNVs. This might take a long time when lots of SNVs need to be fitted, but will typically result in a slightly more accurate purity, especially for rather silent genomes or very low purities. Otherwise, it will just use the purity determined via the SCNA-fit.

...

Additional parameters passed to the segmentation function.

Value

candidates: Results of the grid search.
results: All local optima, sorted by final rank.
input: The input data.

Examples

Run this code

gatk.normal.file <- system.file("extdata", "example_normal.txt", 
    package="PureCN")
gatk.tumor.file <- system.file("extdata", "example_tumor.txt", 
    package="PureCN")
vcf.file <- system.file("extdata", "example_vcf.vcf", 
    package="PureCN")
gc.gene.file <- system.file("extdata", "example_gc.gene.file.txt", 
    package="PureCN")

# Speed-up the runAbsoluteCN call by using the stored grid-search 
# (purecn.example.output$candidates).
data(purecn.example.output)

# The max.candidate.solutions parameter is set to a very low value only to
# speed-up this example.  This is not a good idea for real samples.

ret <-runAbsoluteCN(gatk.normal.file=gatk.normal.file, 
    gatk.tumor.file=gatk.tumor.file, 
    candidates=purecn.example.output$candidates, max.candidate.solutions=2,
    vcf.file=vcf.file, sampleid='Sample1', gc.gene.file=gc.gene.file)

Run the code above in your browser using DataLab