filter.adjustedCADD: Variant filtering based on frequency and median adjusted CADD by CADD regions

Description

Filter rare variants based on a MAF threshold, a given number of SNP or a given cumulative MAF per genomic region and the median of adjusted CADD score for each CADD region

Usage

filter.adjustedCADD(x, SNVs.scores = NULL, indels.scores = NULL,
                    ref.level = NULL, 
                    filter=c("whole", "controls", "any"), 
                    maf.threshold=0.01, min.nb.snps = 2, 
                    min.cumulative.maf = NULL, 
                    group = NULL, cores = 10, path.data, verbose = T,
                    build = c("b37", "b38"))

Value

A bed.matrix with filtered variants

Arguments

x: A bed.matrix annotated with CADD regions using set.CADDregions
SNVs.scores: A dataframe containing the ADJUSTED CADD scores of the SNVs (Optional, useful to gain in computation time if the adjusted CADD scores of variants in the study are available)
indels.scores: A dataframe containing the CADD PHREDv1.4 scores of the indels - Compulsory if indels are present in x
ref.level: The level corresponding to the controls group, only needed if filter=="controls"
filter: On which group the filter will be applied
maf.threshold: The MAF threshold used to define a rare variant, set at 0.01 by default
min.nb.snps: The minimum number of variants needed to keep a CADD region, set at 2 by default
min.cumulative.maf: The minimum cumulative maf of variants needed to keep a CADD region
group: A factor indicating the group of each individual, only needed if filter = "controls" or filter = "any". If missing, x@ped$pheno is taken
cores: How many cores to use, set at 10 by default
path.data: The repository where data for RAVA-FIRST are or will be downloaded from https://lysine.univ-brest.fr/RAVA-FIRST/
verbose: Whether to display information about the function actions
build: The build of the data, either "b37" or "b38". The CADD Regions in the corresponding build will be considered

Details

Variants are directly annotated with the adjusted CADD scores in the function using the file "AdjustedCADD_v1.4_202108.tsv.gz" downloaded from https://lysine.univ-brest.fr/RAVA-FIRST/ in the repository of the package Ravages or the scores of variants can be provided to variant.scores to gain in computation time (this file should contain 5 columns: the chromosome ('chr'), position ('pos'), reference allele ('A1'), alternative allele ('A2') and adjusted CADD scores ('adjCADD'). As CADD scores are only available for SNVs, only those ones will be kept in the analysis.

If a column 'adjCADD' is already present in x@snps, no annotation will be performed and filtering will be directly on this column.

To use this function, a factor 'genomic.region' corresponding to the CADD regions and a vector 'adjCADD.Median' should be present in the slot x@snps. To obtain those two, use the function set.CADDregions.

Only variants with an adjusted CADD score upper than the median value are kept in the analysis. It is the filtering strategy applied in the RAVA.FIRST() pipeline.

If filter="whole", only the variants having a MAF lower than the threshold in the entire sample are kept.

If filter="controls", only the variants having a MAF lower than the threshold in the controls group are kept.

If filter="any", only the variants having a MAF lower than the threshold in any of the groups are kept.

It is recommended to use this function chromosome by chromosome for large datasets.

Examples

Run this code

#Import 1000Genome data from region around LCT gene (b37)
#x37 <- read.bed.matrix( system.file("extdata", "LCT.EUR.b37.bed", package="Ravages") )

#Group variants within CADD regions and genomic categories
#x <- set.CADDregions(x, build = "b37")

#Annotate variants with adjusted CADD score
#and filter on frequency and median
#x.median <- filter.adjustedCADD(x, maf.threshold = 0.025, 
#                                min.nb.snps = 2, build = "b37")