The main function of the QCEWAS package.
EWAS_QC accepts a single EWAS results file and runs a
thorough quality check (QC),
optionally applies various filters and generates QQ, Volcano
and Manhattan plots. The function EWAS_series
can be used to process multiple results files sequentially.
EWAS_QC(data,
map,
outputname,
header_translations,
threshold_outliers = c(NA, NA),
markers_to_exclude,
exclude_outliers = FALSE,
exclude_X = FALSE, exclude_Y = FALSE,
save_final_dataset = TRUE, gzip_final_dataset = TRUE,
header_final_dataset = "standard",
high_quality_plots = FALSE,
return_beta = FALSE, N_return_beta = 500000L,
...)The main output of EWAS_QC are the cleaned results
file, log file and QC graphs. However, the function also
returns a list with 9 elements:
the file name of the input file, if loaded from a file. If not, this will be an empty character string.
the filename of the cleaned results file.
logical, indicates whether EWAS_QC
was able to run a full QC on the file. Note that a TRUE
value does not mean that no problems where encountered,
merely that the full QC was executed.
the lambda value of reported p-values in the cleaned dataset.
the correlation between reported and expected (based on effect size and standard error) p values.
a named integer vector reporting how many markers
were in the original dataset, how many had missing values,
how many were on chromosomes X and Y, how many were outliers,
how many were removed and how many are in the final, cleaned
dataset. Has no relation to the N argument of
EWAS_series.
a numeric value: the median of the standard errors in the cleaned dataset.
a NULL: this functionality
has not been implemented yet.
if return_beta is TRUE, this
is a numeric vector of length N_return_beta,
containing a random selection of effect sizes from the
filtered dataset. If FALSE, this will be NULL.
a data frame with EWAS results, or the name of a file
containing the same. The table must include the columns
PROBEID, BETA, SE, and P_VAL.
Other columns may be included but will be ignored. If the
column names differ from the above, the argument
header_translations can be used to translate them.
If a filename is entered in this argument, it
will be imported via the read.table function.
read.table can handle a variety of formats,
including files compressed in the .gz format. EWAS_QC
will pass any named, unknown arguments to
read.table, so you can specify the column
separator and NA string with the usual
read.table arguments. (Note that this only
applied to importing the EWAS results, and not the map or
translation files.)
a data frame with chromosome and position values of the
probes, or the name of a file containing the same. This
argument is optional: if no map is specified,
EWAS_QC will skip the Manhattan plot and chromosome
filters. map must include the columns TARGETID,
CHR (chromosome), and MAPINFO (position), using
those exact names. Other columns may be included but will be
ignored. If a filename is entered in this argument, it
will be imported via the read.table function.
read.table can handle a variety of formats,
including files compressed in the .gz format.
a character string specifying the intended filename for the
output. This includes not only the cleaned results file and
the log, but also any graphs created. Do not include an
extension; EWAS_QC adds these automatically.
a translation table for the column names of the input file,
or the name of a file containing the same. This argument is
optional: if not specified, EWAS_QC assumes the
default column names are used. See
translate_header for information on the
format.
a numeric string of length two. This defines which effect
sizes will be treated as outliers. The first value specifies
the lower limit (i.e. markers with effect sizes below this
value are considered outliers), the second the upper limit.
The check for low or high outliers is skipped if the
respective value is set to NA. To skip the check
entirely, set this argument to c(NA, NA).
Either a vector or data frame containing a list of CpG IDs
that need to be excluded before starting the QC (in case of
a data frame only the first column will be processed), or
the name of a file containing the same. This argument is
optional: if not specified, no exclusions are made. Note
that when a single value (a vector of length 1) is
passed to this argument, EWAS_QC will treat it as a
filename even when no such file can be found. If you want
to remove a single CpG, either pass it to this argument
via a file, or add a dummy value to the vector to give it
length 2 (e.g. c("cg02198983", "dummy") ).
a logical value determining how outliers are treated. If
TRUE, they are excluded from the final dataset. If
FALSE, they are merely counted.
logical values determining whether markers at the X and Y
chromosome respectively are excluded from the final dataset.
This requires providing a map to EWAS_QC via the
map argument.
logical determining whether the cleaned dataset will be saved.
logical determining whether the saved dataset will be compressed in the .gz format.
either a character vector or a table determining the header
names used in the final dataset, or the name of a file
containing the same. If "original", the
final dataset will use the same column names as the original
input file. If "standard", it will use the default
EWAS_QC column names. If a table, it will be passed
to translate_header to convert the column
names. If a table, the default column names (PROBEID,
BETA, SE, and P_VAL) must be in the
second column, and the desired column names in the first.
logical. Setting this to TRUE will save the graphs as high-resolution tiff images.
arguments used by EWAS_series. These are not
important for users and can be ignored. For the sake of
completeness: return_beta is a logical value; if
TRUE, the function return value includes a vector of
effect sizes. N_return_beta defines the length of the
vector.
arguments passed to read.table for importing
the EWAS results file.
QCEWAS includes a Quick-Start guide in the doc
folder of the library. This guide will explain how to
run a QC and how to interpret the results.
The start-up message when loading
QCEWAS will indicate where it can be found on your
computer. In brief, the QC consists of the following 5 stages:
Checking data integrity:
The values inside the EWAS results are tested for validity.
If impossible p-values, effect-sizes, etc. are encountered,
EWAS_QC generates a warning in the R console and sets
them to NA.
Filter for outliers and sex-chromosomes (optional)
Counts the number of outlying markers, as well as chromosome
X and Y markers, and deletes them if specified. The markers
named in markers_to_exclude are removed here as well.
Generating QC plots
A histogram of beta and standard error distribution is plotted.
The p-values are checked by correlating and plotting them against p-values calculated from the effect size and standard error.
A QQ plot is generated to test for over/undersignificance.
A Manhattan plot is generated to see where the signals (if any) are located.
A Volcano plot is generated to check the distribution of effect sizes vs. p values.
Creating a QC log
The log contains notes about any problems encountered during the QC, as well as several tables describing the data.
Saving the cleaned dataset (optional)
See EWAS_series for running a QC over multiple
files.
See EWAS_plots and P_correlation
for carrying out specific steps of the QC.
# For use in this example, the 2 sample files in the
# extdata folder of the QCEWAS library will be copied
# to your current R working directory. Running the QC
# generates 7 new files in your working directory:
# a cleaned, post-QC dataset, a log file, and 5 graphs.
# Consult the Quick-Start guide for more information on
# how to interpret these.
if (FALSE) {
file.copy(from = file.path(system.file("extdata", package = "QCEWAS"),
"sample_map.txt.gz"),
to = getwd(), overwrite = FALSE, recursive = FALSE)
file.copy(from = file.path(system.file("extdata", package = "QCEWAS"),
"sample1.txt.gz"),
to = getwd(), overwrite = FALSE, recursive = FALSE)
QC_results <- EWAS_QC(data = "sample1.txt.gz",
map = "sample_map.txt.gz",
outputname = "sample_output",
threshold_outliers = c(-20, 20),
exclude_outliers = FALSE,
exclude_X = TRUE, exclude_Y = FALSE,
save_final_dataset = TRUE, gzip_final_dataset = FALSE)
}
Run the code above in your browser using DataLab