EWAS_QC: Automated Quality Control of EWAS results files

Description

The main function of the QCEWAS package. EWAS_QC accepts a single EWAS results file and runs a thorough quality check (QC), optionally applies various filters and generates QQ, Volcano and Manhattan plots. The function EWAS_series can be used to process multiple results files sequentially.

Usage

EWAS_QC(data,
          map,
          outputname,
          header_translations,
          threshold_outliers = c(NA, NA),
          markers_to_exclude,
          exclude_outliers = FALSE,
          exclude_X = FALSE, exclude_Y = FALSE,
          save_final_dataset = TRUE, gzip_final_dataset = TRUE,
          header_final_dataset = "standard",
          high_quality_plots = FALSE,
          return_beta = FALSE, N_return_beta = 500000L,
          ...)

Value

The main output of EWAS_QC are the cleaned results file, log file and QC graphs. However, the function also returns a list with 9 elements:

data_input: the file name of the input file, if loaded from a file. If not, this will be an empty character string.
file: the filename of the cleaned results file.
QC_success: logical, indicates whether EWAS_QC was able to run a full QC on the file. Note that a TRUE value does not mean that no problems where encountered, merely that the full QC was executed.
lambda: the lambda value of reported p-values in the cleaned dataset.
p_cor: the correlation between reported and expected (based on effect size and standard error) p values.
N: a named integer vector reporting how many markers were in the original dataset, how many had missing values, how many were on chromosomes X and Y, how many were outliers, how many were removed and how many are in the final, cleaned dataset. Has no relation to the N argument of EWAS_series.
SE_median: a numeric value: the median of the standard errors in the cleaned dataset.
mean_methylation: a NULL: this functionality has not been implemented yet.
effect_size: if return_beta is TRUE, this is a numeric vector of length N_return_beta, containing a random selection of effect sizes from the filtered dataset. If FALSE, this will be NULL.

Arguments

data: a data frame with EWAS results, or the name of a file containing the same. The table must include the columns PROBEID, BETA, SE, and P_VAL. Other columns may be included but will be ignored. If the column names differ from the above, the argument header_translations can be used to translate them. If a filename is entered in this argument, it will be imported via the read.table function. read.table can handle a variety of formats, including files compressed in the .gz format. EWAS_QC will pass any named, unknown arguments to read.table, so you can specify the column separator and NA string with the usual read.table arguments. (Note that this only applied to importing the EWAS results, and not the map or translation files.)
map: a data frame with chromosome and position values of the probes, or the name of a file containing the same. This argument is optional: if no map is specified, EWAS_QC will skip the Manhattan plot and chromosome filters. map must include the columns TARGETID, CHR (chromosome), and MAPINFO (position), using those exact names. Other columns may be included but will be ignored. If a filename is entered in this argument, it will be imported via the read.table function. read.table can handle a variety of formats, including files compressed in the .gz format.
outputname: a character string specifying the intended filename for the output. This includes not only the cleaned results file and the log, but also any graphs created. Do not include an extension; EWAS_QC adds these automatically.
header_translations: a translation table for the column names of the input file, or the name of a file containing the same. This argument is optional: if not specified, EWAS_QC assumes the default column names are used. See translate_header for information on the format.
threshold_outliers: a numeric string of length two. This defines which effect sizes will be treated as outliers. The first value specifies the lower limit (i.e. markers with effect sizes below this value are considered outliers), the second the upper limit. The check for low or high outliers is skipped if the respective value is set to NA. To skip the check entirely, set this argument to c(NA, NA).
markers_to_exclude: Either a vector or data frame containing a list of CpG IDs that need to be excluded before starting the QC (in case of a data frame only the first column will be processed), or the name of a file containing the same. This argument is optional: if not specified, no exclusions are made. Note that when a single value (a vector of length 1) is passed to this argument, EWAS_QC will treat it as a filename even when no such file can be found. If you want to remove a single CpG, either pass it to this argument via a file, or add a dummy value to the vector to give it length 2 (e.g. c("cg02198983", "dummy") ).
exclude_outliers: a logical value determining how outliers are treated. If TRUE, they are excluded from the final dataset. If FALSE, they are merely counted.
exclude_X, exclude_Y: logical values determining whether markers at the X and Y chromosome respectively are excluded from the final dataset. This requires providing a map to EWAS_QC via the map argument.
save_final_dataset: logical determining whether the cleaned dataset will be saved.
gzip_final_dataset: logical determining whether the saved dataset will be compressed in the .gz format.
header_final_dataset: either a character vector or a table determining the header names used in the final dataset, or the name of a file containing the same. If "original", the final dataset will use the same column names as the original input file. If "standard", it will use the default EWAS_QC column names. If a table, it will be passed to translate_header to convert the column names. If a table, the default column names (PROBEID, BETA, SE, and P_VAL) must be in the second column, and the desired column names in the first.
high_quality_plots: logical. Setting this to TRUE will save the graphs as high-resolution tiff images.
return_beta, N_return_beta: arguments used by EWAS_series. These are not important for users and can be ignored. For the sake of completeness: return_beta is a logical value; if TRUE, the function return value includes a vector of effect sizes. N_return_beta defines the length of the vector.
...: arguments passed to read.table for importing the EWAS results file.

Details

QCEWAS includes a Quick-Start guide in the doc folder of the library. This guide will explain how to run a QC and how to interpret the results. The start-up message when loading QCEWAS will indicate where it can be found on your computer. In brief, the QC consists of the following 5 stages:

Checking data integrity:

The values inside the EWAS results are tested for validity. If impossible p-values, effect-sizes, etc. are encountered, EWAS_QC generates a warning in the R console and sets them to NA.
Filter for outliers and sex-chromosomes (optional)

Counts the number of outlying markers, as well as chromosome X and Y markers, and deletes them if specified. The markers named in markers_to_exclude are removed here as well.
Generating QC plots

A histogram of beta and standard error distribution is plotted.

The p-values are checked by correlating and plotting them against p-values calculated from the effect size and standard error.

A QQ plot is generated to test for over/undersignificance.

A Manhattan plot is generated to see where the signals (if any) are located.

A Volcano plot is generated to check the distribution of effect sizes vs. p values.
Creating a QC log

The log contains notes about any problems encountered during the QC, as well as several tables describing the data.
Saving the cleaned dataset (optional)

Examples

Run this code

# For use in this example, the 2 sample files in the
# extdata folder of the QCEWAS library will be copied
# to your current R working directory. Running the QC
# generates 7 new files in your working directory:
# a cleaned, post-QC dataset, a log file, and 5 graphs.
# Consult the Quick-Start guide for more information on
# how to interpret these.
if (FALSE) {
file.copy(from = file.path(system.file("extdata", package = "QCEWAS"),
                           "sample_map.txt.gz"),
          to = getwd(), overwrite = FALSE, recursive = FALSE)
file.copy(from = file.path(system.file("extdata", package = "QCEWAS"),
                           "sample1.txt.gz"),
          to = getwd(), overwrite = FALSE, recursive = FALSE)

QC_results <- EWAS_QC(data = "sample1.txt.gz",
                      map = "sample_map.txt.gz",
                      outputname = "sample_output",
                      threshold_outliers = c(-20, 20),
                      exclude_outliers = FALSE,
                      exclude_X = TRUE, exclude_Y = FALSE,
                      save_final_dataset = TRUE, gzip_final_dataset = FALSE)
}