Learn R Programming

preprint: https://www.biorxiv.org/content/early/2018/07/30/380097

superFreq

SuperFreq is an R package that analyses cancer exomes.

What does it do?

SuperFreq analyses and filters somatic SNVs and short indels, calls copy numbers and tracks clones over multiple samples from the same individual. It identifies the copy number alterations and point mutations in each clone, and highlights potentially causing mutations through variant annotation and COSMIC.

How do I run it?

Start R

#first install superFreq if running first time
#load devtools that allows installation from github (may need to install devtools first with install.packages("devtools"))
library(devtools)

#there has been conflicts between install_github and bioconductor in different version
#so safer to manually install bioconductor dependencies first.
source("https://bioconductor.org/biocLite.R")
biocLite("GenomeInfoDb")
biocLite("GenomicFeatures")
biocLite("VariantAnnotation")

#then install superFreq
install_github('ChristofferFlensburg/superFreq')

#load and test superFreq
library(superFreq)
?superFreq

A typical analysis first sets the parameters and then calls the superFreq() function.

library(superFreq)

#maximum number of threads.
cpus=6

#this is the meta data input. See ?superFreq for how to set it up.
metaDataFile = 'metaData.tsv'

#an bed file with the capture regions of the exome.
captureRegionsFile = '~/resources/captureRegions/myCaptureInThisBatch.bed'
#if left empty (which is the default), then superFreq uses the ensembl exons for the specified genome assembly.
captureRegionsFile = ''

#This directory with (links to) the reference normals needs to be created and set up. See ?superFreq
normalDirectory = '~/resources/superFreq/referenceNormals/myCaptureInThisBatch'

#The reference fasta and name. Only hg19, hg38 and mm10 available atm.
reference = '~/resources/reference/hg19/hg19.fa'
genome = 'hg19'

#the resource directory with preprocessed data from dbSNP, ExAC, COSMIC, gene annotation and ClinVar.
#Will be created and resources downlaoded from the WEHI server. If you point to an existing resource directory
#then it'll be reused, and resources won't have to be redownloaded and duplicated.
resourceDirectory = "superFreqResources"

#The directory where the log file and saved .Rdata is stored. Will be created.
Rdirectory = 'R'
#The directory where all the plots and tables from the analysis go. Will be created.
plotDirectory = 'plots'

#superFreq reuses saved data if available. This setting can force it to redo part of the analysis.
#default forceRedoNothing() means that saved information is used whenever available.
#forceRedoEverything() ignores any saved data from this batch and overwrites any previous results and plots.
forceRedo = forceRedoNothing()

#a measure on how much large-scale biases are expected in the coverage.
#this controls the sensitivity vs accuracy of the coverage part of copy number calls.
#0.02 is default and works for exomes. Genomes might be able to go lower.
#For RNA-Seq data, 0.1 is better starting point.
systematicVariance=0.02
#a measure on how much biases (such as PCR duplication) is expected in the VAFs.
#this controls the sensitivity vs accuracy of the heterozygous SNP part of copy number calls.
maxCov=150

#The format of the quality scores of the base calls. Almost always 33 these days.
BQoffset = 33

#The mode. Default 'exome' is for exomes, while 'RNA' has some minor changes when running on RNA.
#There is also a "genome" mode for genomes although it is still being developed and can be very slow.
mode = 'exome'

#This setting runs each individual separately (as indicated in the metadata).
#will create subdirectories in the plotDirectory and Rdirectory.
#This is suggested whenever there is more than one individual in the batch.
splitRun = T

#this performs the actual analysis. output goes to Rdirectory and plotDirectory.
#runtime is typically less than 6 hours at 4 cpus for a cancer-normal exome, but can vary significantly depending on input.
#For a typical cancer-normal exome, 5-10GB of memory is used per cpus, but again, can vary significantly depending on input.
#later runs typically a bit faster as the setup and part of the analysis on the reference normals can be reused.
data =
    superFreq(metaDataFile, captureRegions=captureRegionsFile, normalDirectory=normalDirectory,
              Rdirectory=Rdirectory, plotDirectory=plotDirectory, reference=reference, genome=genome,
              BQoffset=BQoffset, cpus=cpus, forceRedo=forceRedo, systematicVariance=systematicVariance,
              maxCov=maxCov, mode=mode, splitRun=splitRun, resourceDirectory=resourceDirectory)

The most convenient way to set this up is to create a runSuperFreq.R file with the above code, and then run it from a screen with

Rscript runSuperFreq.R

Currently in development (please help if you can!), you can generate a .html to explore the data by adding this line after the superFreq() call. Currently no more than a draft.

printHTML(metaDataFile=metaDataFile, outputFile=paste0(plotDirectory, '/superFreq.html'))

More information is in the manual.

What is the input?

You need the aligned bam files of the exomes, and a preliminary (liberal) variant calling through for example varScan, mutect, multiSNV or any other similar software. superFreq is not sensitive to false positives in these VCFs. The details of this doesn't affect output much as long as all the somatic and germline variants are present. In house we typically use samtools and varscan:

samtools mpileup -d 1000 -q 15 -Q 15 -A -f reference/theAlignedToReference.fa  bam/mySample.bam | varscan mpileup2cns - --variants --strand-filter 0 --p-value 0.01 --min-var-freq 0.02 --output-vcf 1 > vcf/mySample.vcf

SuperFreq also requires a set of (at least 2, 5+ is better) reference normal samples from the same sequencing technology and exome capture. Preferably sequenced in the same lab. These samples do not have to be related to the analysed cancer exomes.

You also need some meta data:

  • the fasta file you aligned to
  • a tab separated file with information about the samples

What is the output?

Plots (some shown here), tables, and R objects for downstream analysis. Analysis results as well as diagnostic quality control. Some examples:

  • Scatter plots of variant frequencies between pairs of samples
  • heatmaps of somatic variant frequencies
  • CNA plots for each samples
  • river plots showing clonal evolution
  • summary plots showing SNVs and CNAs over all samples
  • top tables of annotated somatic variants.

dependencies

  • R.
  • a bunch of R packages.
  • samtools 1.x

Acknowledgements

We wish to thanks all the organisations sharing data and resources openly, which allows preprocessing and redistribution. This allows superFreq to depend on only a single connection (which is the WEHI servers atm) and a minimum amount of data downloaded. So this limits the risk of 'server X could not be found' type of errors, and it limits the download size of the preprocessed data to hundreds of MBs, rather than hundreds of GBs for the unprocessed original data. It also saves time for everyone by removing user parsing of external resources from the analysis.

  • dbSNP
  • COSMIC
  • ExAC
  • ClinVar through NCBI
  • Ensembl
  • UCSC

Copy Link

Version

Version

1.2.3

License

MIT

Maintainer

Christoffer Flensburg

Last Published

April 2nd, 2019

Functions in superFreq (1.2.3)

addBindingStrength

Adds binding strength (essentailly same as GC content) to a bedfile.
defaultSuperSettings

Returns default settings
analyse

Analyse exomes
defaultSuperParameters

Returns default parameters
defaultSuperRuntimeSettings

Returns default runtime settings
VAseverityToConsequence

coding' is used by the location run to cover all the coding variants, and will be replaced
getSettings

Wrapper for getting settings, containing defaults for missing values.
getStories

Combines variants and CNAs to clonal evolution
makeHeatmap

plots a heatmap
mcri

Turns colours into similar colours from the Murdoch Childrens Research Institute palette.
chrToX

convert chromsome and position to the single genomic coordinate x
downloadTemplate

Downloads resources and template for a standard analysis
compareGroups

Compares sets of individuals for reccuring mutations
cohortAnalyseBatchContrast

Analyse individuals for reccuring mutations
cohortAnalyseBatch

Analyse individuals for reccuring mutations
importSampleMetaData

Imports metadata about the samples
moreVEPnames

internal function
runDE

Run differential coverage analysis.
mergeBatches

Merges data from several batches for cohort analysis
loadMethods

Not needed in the package, ignore this function.
dumpInput

internal function for debugging purposes.
runVEP

Run VEP on the provided variants
forceRedoNothing

returns input that uses saved data if present.
loadData

Loads saved data
get104profile

returns the 104 signature from a superFreq q variant object
getCosmicCounts

Retrieves information about cosmic variant IDs
postAnalyseVEP

Runs VEP on a R and plot directory that superFreq has been run on.
forceRedoEverything

returns input that uses saved data if present.
plotStories

plots line plots of clonalities
qualityScatter

plots a scatter of the VAF between two samples
requireFileExists

checks if a file exists, and creates an error if it doesnt.
vafScatter

plots a scatter of the VAF between two samples
getMoreVEPinfo

Import more information about the variants
printHTML

generates an HTML to navigate superFreq output
superVersion

Return the current version of superFreq.
typeToSeverity

Ranks the variant effects
qToGRanges

Transforms a variant q data frame into granges
writeToVCF

exports variants to VCF
makeFitPlots

Plots results from differential coverage analysis
getVariantsByIndividual

Quality controls the variants in the vcfs, using the bams.
superInputFiles

sets up and checks the input files for superFreq
superOutputDirectories

Sets up and checks the output directories.
plotMA

a plotting function for MA plots.
plotRiver

plots a river plot
importCaptureRegions

Imports capture regions from a bed file with GC information.
plotCR

plots a copy number profile
superCohort

looks for reccuring events across all individuals in a batch run
plotColourScatter

A better version of plot
superFreq

Wrapper to run default superFreq analysis
xToPos

Extract position of genomic coordinate
severityToType

The variant effect as function of severity
shrug

shruggie
xToChr

Extract chromsome of genomic coordinate
xToGene

find the gene closest to the genomic coordinate x
chrLengths

The lengths of the chromsomes
bamToPileup

Gets pileup from bam over positions.
annotateQ

Run VariantAnnotation on the provided variants
annotateSomaticQs

Run VariantAnnotation on the provided samples
VAconsequenceToSeverityRank

coding' is used by the location run to cover all the coding variants, and will be replaced