This function was created as a convenient way to automate the
removal of low-quality and non-autosomal SNPs. It
includes the same formatting options as QC_GWAS
.
filter_GWAS(ini_file,
GWAS_files, output_names,
gzip_output = TRUE,
dir_GWAS = getwd(), dir_output = dir_GWAS,
FRQ_HQ = NULL, HWE_HQ = NULL,
cal_HQ = NULL, imp_HQ = NULL,
FRQ_NA = TRUE, HWE_NA = TRUE,
cal_NA = TRUE, imp_NA = TRUE,
ignore_impstatus = FALSE,
remove_X = FALSE, remove_Y = FALSE,
remove_XY = FALSE, remove_M = FALSE,
header_translations,
check_impstatus = FALSE,
imputed_T = c("1", "TRUE", "yes", "YES", "y", "Y"),
imputed_F = c("0", "FALSE", "no", "NO", "n", "N"),
imputed_NA = NULL,
column_separators = c("\t", " ", "", ",", ";"),
header = TRUE, nrows = -1, nrows_test = 1000,
comment.char = "", na.strings = c("NA", "."),
out_header = "original", out_quote = FALSE,
out_sep = "\t", out_eol = "\n", out_na = "NA",
out_dec = ".", out_qmethod = "escape",
out_rownames = FALSE, out_colnames = TRUE, ...)
(the filename of) a table listing the files to be processed and the filters to be applied. See 'Details'.
character vector: when no ini_file
is
provided, this identifies the files to be processed. See
'Details'.
character vector: the filenames for the
output files. The default option is to use the input
filenames. Note that, unlike with other QCGWAS
functions, the file extensions should be included (However,
the function will automatically add ".gz"
when the
files are compressed.
logical; should the output files be compressed?
character-strings specifying the directory address of the folders for the input files and the output, respectively. Note that R uses forward slash (/) where Windows uses backslash (\).
Numeric vectors. When no ini_file
is provided, these
arguments specify the filter threshold-values for allele
frequency, HWE p-value, callrate and imputation quality,
respectively. Passed to HQ_filter
.
Logical vectors. When no ini_file
is provided, these
arguments specify whether missing values (of allele frequency,
HWE p-value, callrates and imputation quality, respectively)
are excluded (TRUE
) or ignored (FALSE
).
Passed to HQ_filter
.
Logical vector. When no ini_file
is provided, this
argument specifies whether imputation status is taken into
account when applying the filters. If FALSE
, HWE p-value
and callrate filters are applied only to genotyped SNPs, and
imputation quality filters only to imputed SNPs. If
TRUE
, the filters are applied to all SNPs regardless
of the imputation status.
logical; respectively whether X-chromosome, Y-chromosome,
pseudo-autosomal and mitochondrial SNPs are removed. Note:
these arguments accept only a single TRUE
or
FALSE
value.
Unlike the above settings, it's not possible to specify
them independently for every dataset.
translation table for column names.
See translate_header
for more information. If
the argument is left empty, dataset
is assumed to use
the standard column-names used by QC_GWAS
.
logical; should
convert_impstatus
be called to convert the
imputation-status column into standard values?
arguments passed to
convert_impstatus
.
character string or vector; specifies
the values used as column delimitator in the GWAS file(s). The
argument is passed to load_test
; see the
description of that function for more information.
integer; the number of rows used for
"trial-loading". Before loading the entire dataset, the
function load_test
is called to determine the
dataset's file-format by reading the top x
lines, where
x
is nrows_test
. Setting nrows_test
to
a low number
(e.g. 150
) means quick testing, but runs the risk of
missing problems in lower rows. To test the entire dataset,
set it to -1
.
arguments passed to read.table
when importing
the dataset.
Translation table for the column names of
the output file. This argument is the opposite of
header_translations
: it translates the standard
column-names of QC_GWAS
to user-defined ones.
output_header
can be one of three things:
A user specified table similar to the one used by
translate_header
. However, as this
translates standard names into non-standard ones, the
standard names should be in the right column, and the
desired ones in the left. There is also no requirement
for the names in the left column to be uppercase.
The name of a file in dir_GWAS
containing
such a table.
Character string specifying a standard form. See
QC_GWAS
, section 'Output header' for the
options.
arguments passed to
write.table
when saving the final dataset.
An invisible logical vector, indicating which files were successfully filtered.
The easiest way to use filter_GWAS
is by passing an ini
file to the ini_file
argument.
The ini file can be generated by running QC_series
with the save_filtersettings
argument set to TRUE
.
The output will include a file 'Check_filtersettings.txt',
describing the (high-quality) filter settings used for each
file (taking into account whether there was enough data, i.e.
whether the use_threshold
was met, to apply the filters).
The ini_file
argument accepts both a table
or the name of a file in dir_GWAS
or the
current R working directory.
If no ini_file
is specified, the function will use the
GWAS_files
, x_HQ, x_NA and ignore_impstatus
arguments to construct such a table.
GWAS_files
can either be a character vector or a single
value. If a single string, all filenames containing the string
will be processed. The other arguments can also be a vector or
a single value; if the latter, they will be recycled to create
a vector of the correct length.
If neither ini_file
nor GWAS_files
are specified,
the function will look for a file
Check_filtersettings.txt
in dir_GWAS
and the current R working directory.
Note that ini_file
overrules the other filter settings,
i.e. one cannot adjust ini_file
through the other
arguments.