QC_plots: QQ and Manhattan plots

Description

This function creates the most important graphs of the QC: the QQ plots and the Manhattan plot. It also calculates lambda, and determines the effect of the filters.

Usage

QC_plots(dataset,
         plot_QQ = TRUE, plot_Man = TRUE,
         FRQfilter_values = NULL, FRQfilter_NA = filter_NA,
         HWEfilter_values = NULL, HWEfilter_NA = filter_NA,
         calfilter_values = NULL, calfilter_NA = filter_NA,
         impfilter_values = NULL, impfilter_NA = filter_NA,
         impfilter_min = min(dataset$IMP_QUALITY, na.rm = TRUE),
         manfilter_FRQ = NULL, manfilter_HWE = NULL,
         manfilter_cal = NULL, manfilter_imp = NULL,
         filter_NA = TRUE,
         plot_cutoff_p = 0.05, plot_names = FALSE,
         QQ_colors = c("red", "blue", "orange", "green3", "yellow"),
         plot_QQ_bands = FALSE,
         save_name = "dataset", save_dir = getwd(),
         header_translations, use_log = FALSE,
         check_impstatus = FALSE, ignore_impstatus = FALSE,
         T_strings = c("1", "TRUE", "yes", "YES", "y", "Y"),
         F_strings = c("0", "FALSE", "no", "NO", "n", "N"),
         NA_strings = c(NA, "NA", ".", "-"))

Arguments

dataset

vector of p-values or a data frame containing the p-value column and (depending on the settings) columns for chromosome number, position, the quality parameters, sample size and imputation status.

plot_QQ, plot_Man

logical; should QQ and Manhattan plots be saved?

FRQfilter_values, HWEfilter_values, calfilter_values, impfilter_values

numeric vectors; the threshold values for the QQ plot filters. The filters are for allele-frequency, HWE p-values, callrate and imputation-quality parameters, respectively. A maximum of five values can be specified per parameter.

Set to NULL to disable the QQ filter for that parameter.
To filter missing values only, set the filter value to NA and the corresponding filter_NA argument to TRUE.
The allele-frequency filter is two-sided: for a filter-value of x, it will exclude entries with freq < x and freq > 1 - x.
Values >= 1 will be divided by the SNP's sample size. This allows sample-size dependent filtering of allele frequencies. Note that this uses the sample size reported in the sample-size column for that specific SNP. SNPs without sample size will be excluded if the corresponding filter_NA argument is TRUE and ignored if it is FALSE.

FRQfilter_NA, HWEfilter_NA, calfilter_NA, impfilter_NA, filter_NA

logical; should the filters exclude (TRUE) or ignore (FALSE) missing values? filter_NA is the default setting, the others allow variable-specific settings.

impfilter_min

numeric; the lowest possible value for imputation-quality. This argument is currently redundant, as it is calculated automatically.

manfilter_FRQ, manfilter_HWE, manfilter_cal, manfilter_imp

single, numeric values; the filter-settings for allele-frequency, HWE p-values, callrate and imputation quality respectively, for the Manhattan plot. The arguments are passed to HQ_filter. To filter missing values only, set to NA and the corresponding filter_NA argument to TRUE. To disable filtering entirely, set to NULL.

plot_cutoff_p

numeric; the threshold of p-values to be shown in the QQ & Manhattan plots. Higher (less significant) p-values are excluded from the plot. The default setting is 0.05, which excludes 95% of data-points. It's not recommended to increase the value above 0.05, as this may dramatically increase running time and memory usage.

plot_names

argument currently redundant.

QQ_colors

vector of R color-values; the color of the QQ filter-plots. The unfiltered data is black by default. This argument sets the colors of the least (first value) to most (last value) stringent filters. (For this setting, filter values >= 1 (i.e. sample-size based filtering) are considered less stringent than values < 1.)

plot_QQ_bands

logical; should probability bands be added to the QQ plot?

save_name

character string; the filename, without extension, for the graphs.

save_dir

character string; the directory where the graphs are saved. Note that R uses forward slash (/) where Windows uses the backslash (\).

header_translations

translation table for column names. See translate_header for more information. If the argument is left empty, dataset is assumed to use the standard column-names of QC_GWAS.

use_log

argument used by QC_GWAS; redundant when QC_plots is used separately.

check_impstatus

logical; should the imputation-status column be passed to convert_impstatus?

ignore_impstatus

logical; if FALSE, HWE p-value and callrate filters are applied only to genotyped SNPs, and imputation quality filters only to imputed SNPs. If TRUE, the filters are applied to all SNPs regardless of the imputation status.

T_strings, F_strings, NA_strings

arguments passed to convert_impstatus.

Value

An object of class 'list' with the following components:

lambda

vector of the lambda values of all SNPs, genotyped SNPs and imputed SNPs, respectively.

ignore_impstatus

logical value indicating whether imputation status was used when applying the filters.

FRQfilter_names, HWEfilter_names, calfilter_names, impfilter_names

character vectors naming the specified QQ filters.

FRQfilter_N, HWEfilter_N, calfilter_N, impfilter_N

numeric vectors; the number of SNPs removed by the specified filters. Note that the filters are sorted before being applied, so the order may not match that of the input. Check the filter_names output to see the order that was used inside QC_plots.

Manfilter_N

numeric; the number of SNPs removed by the Manhattan filter. This does not include those SNPs removed because they lacked p or chromosome/position-values, or failed the p-cutoff threshold.

Details

The function QC_plots grew out of phase 4 of QC_GWAS. It carries out three functions, hence the vague name: it calculates lambda, it applies the QQ filters, and it creates the QQ and Manhattan plots (a separate function is available for creating regional-association plots: see below). The function schematic is as follows:

Preparing the dataset: this step involves translating the dataset header to the standard column-names (by identify_column) and converting imputation status (by convert_impstatus). Both steps are optional, and are disabled by default. If the function cannot identify the imputation status column, it will generate a warning message and disable the imputation-status dependent filters.
Calculating the QC stats: here it generates the filters an calculates how many SNPs are removed. Lambda is also calculated at this point.
Creating a QQ graph of every variable for which filters have been specified. Every graph contains an unfiltered plot, plus plots for every effective filter. ("Effective" means "excludes more SNPs than the previous, less-stringent filter".)
Creating the Manhattan plot. The default Manhattan plot covers chromosomes 1 to 23 (X). Fields for XY, Y and M are added when such SNPs are present.

Examples

Run this code

# NOT RUN {
  
# }
# NOT RUN {
  data("gwa_sample")

  QC_plots(dataset = gwa_sample,
    plot_QQ = TRUE, plot_QQ_bands = TRUE, plot_Man = TRUE,
    FRQfilter_values = c(NA, 0.01, 0.05, 3),
    calfilter_values = c(NA, 0.95, 0.99),
    manfilter_FRQ = 0.05, manfilter_cal = 0.95,
    filter_NA = TRUE, save_name = "sample_plots")
    
# }

Run the code above in your browser using DataLab