QC_GWAS: Automated Quality Control of GWAS Results files

Description

QC_GWAS runs a full quality control (QC) over a single GWAS results file. It removes missing and invalid data, checks the alleles and allele frequency with a reference, tests the reported p-value against both calculated and expected values, creates QQ and Manhattan plots and reports the distribution of the quality-parameters within the dataset, as well as various QC statistics.

QC_series does the same thing for multiple GWAS results files. It's mainly a wrapper that passes individual files to QC_GWAS, but it has a few extra features, such as making a checklist of important QC stats and creating several graphs to compare the QC'ed files.

Although the number of arguments in QC_GWAS may seem overwhelming, only three of them are required to run a basic QC. The name of the file to be QC'ed should be passed to the filename argument; the directory of said file to the dir_data argument; and a header-translation table to the header_translations argument. For a quick introduction to QCGWAS, read the quick reference guide that can be found in "R\library\QCGWAS\doc".

Usage

QC_GWAS(filename,
        filename_output = paste0("QC_", filename),
        dir_data = getwd(),
        dir_output = paste(dir_data, "QCGWASed", sep = "/"),
        dir_references = dir_data,
        header_translations,
        column_separators = c("\t", " ", "", ",", ";"),
        nrows = -1, nrows_test = 1000,
        header = TRUE, comment.char = "",
        na.strings = c("NA", "nan", "NaN", "."),
        imputed_T = c("1", "TRUE", "T"),
        imputed_F = c("0", "FALSE", "F"),
        imputed_NA = c(NA, "-"),
        save_final_dataset = TRUE,
        gzip_final_dataset = TRUE, order_columns = FALSE,
        spreadsheet_friendly_log = FALSE,
        out_header = "standard",
        out_quote = FALSE, out_sep = "\t", out_eol = "\n",
        out_na = "NA", out_dec = ".", out_qmethod = "escape",
        out_rownames = FALSE, out_colnames = TRUE,
        return_HQ_effectsizes = FALSE,
        remove_X = FALSE, remove_Y = FALSE,
        remove_XY = remove_Y, remove_M = FALSE,
        calculate_missing_p = FALSE,
        make_plots = TRUE, only_plot_if_threshold = TRUE,
        threshold_allele_freq_correlation = 0.95,
        threshold_p_correlation = 0.99,
        plot_intensity = FALSE,
        plot_histograms = make_plots, plot_QQ = make_plots,
        plot_QQ_bands = TRUE, plot_Manhattan = make_plots,
        plot_cutoff_p = 0.05,
        allele_ref_std, allele_name_std,
        allele_ref_alt, allele_name_alt,
        update_alt = FALSE, update_savename,
        update_as_rdata = FALSE, backup_alt = FALSE,
        remove_mismatches = TRUE,
        remove_mismatches_std = remove_mismatches,
        remove_mismatches_alt = remove_mismatches,
        threshold_diffEAF = 0.15, remove_diffEAF = FALSE,
        remove_diffEAF_std = remove_diffEAF,
        remove_diffEAF_alt = remove_diffEAF,
        check_ambiguous_alleles = FALSE,
        use_threshold = 0.1,
        useFRQ_threshold = use_threshold,
        useHWE_threshold = use_threshold,
        useCal_threshold = use_threshold,
        useImp_threshold = use_threshold,
        useMan_threshold = use_threshold,
        HQfilter_FRQ = 0.01, HQfilter_HWE = 10^-6,
        HQfilter_cal = 0.95, HQfilter_imp = 0.3,
        QQfilter_FRQ = c(NA, 0.01, 0.05),
        QQfilter_HWE = c(NA, 10^-6, 10^-4),
        QQfilter_cal = c(NA, 0.95, 0.99),
        QQfilter_imp = c(NA, 0.3, 0.5, 0.8),
        NAfilter = TRUE,
        NAfilter_FRQ = NAfilter, NAfilter_HWE = NAfilter,
        NAfilter_cal = NAfilter, NAfilter_imp = NAfilter,
        ignore_impstatus = FALSE,
        minimal_impQ_value = -0.5, maximal_impQ_value = 1.5,
        logI = 1L, logN = 1L, ...)
QC_series(data_files, datafile_tag, output_filenames,
          dir_data = getwd(),
          dir_output = paste(dir_data, "QCGWASed", sep = "/"),
          dir_references = dir_data,
          header_translations, out_header = "standard",
          allele_ref_std, allele_name_std,
          allele_ref_alt, allele_name_alt,
          update_alt = FALSE, update_savename,
          update_as_rdata = FALSE, backup_alt = FALSE,
          plot_effectsizes = TRUE, lim_effectsizes = NULL,
          plot_SE = TRUE, label_SE = TRUE,
          plot_SK = TRUE, label_SK = "outliers",
          save_filtersettings = FALSE, ...)

Value

The most important output of the QC is the log file. See the section 'QC output files' for more details. This section only describes the function return within R.

QC_series returns a single, invisible, logical value, indicating whether the alternative allele-reference has been updated.

QC_GWAS returns an object of class 'list'. If the QC was not successful, this list contains only five of the following components (QC_successful, filename_input, filename_output, all_ref_changed, effectsize_return). If it was, it will contain all of them:

QC_successful

logical; indicates whether the QC was completed. If FALSE, the function was either unable to load the dataset, encountered an unexpected datatype, or removed all SNPs during the QC. The log file and screen output will indicate what triggered the abort.

filename_input, filename_output

the filenames of the dataset pre- and post-QC respectively.

sample_size, sample_size_HQ

the highest reported sample size for all SNPs and high-quality SNPs only, respectively.

lambda, lambda_geno, lambda_imp

the lambda values of all, genotyped and imputed SNPs, respectively.

SNP_N_input

the number of SNPs in the original dataset.

SNP_N_input_monomorphic

the number of SNPs removed because they are monomorphic.

SNP_N_input_monomorphic_identic_alleles

the subset of above that had identical alleles, but allele-frequencies that were not 0 or 1.

SNP_N_input_chr

the number of SNPs removed because they were X-chromosomal, Y-chromosomal, pseudo-autosomal or mitochondrial (depends on the remove-arguments). If all remove arguments were set to FALSE, this returns NA.

SNP_N_preQC

the number of SNPs that entered phase 2b (i.e. after removal of the monomorphic and excluded-chromosome SNPs).

SNP_N_preQC_unusable

the number of SNPs removed in phase 2d, due to missing or invalid crucial variables.

SNP_N_preQC_invalid

the number of SNPs with invalid, non-crucial values in phase 2d.

SNP_N_preQC_min

the number of negative-strand SNPs in phase 2d.

SNP_N_midQC

the number of SNPs in the dataset during allele-matching (phase 3).

SNP_N_midQC_min

the number of negative strand SNPs in phase 3.

SNP_N_midQC_min_std, SNP_N_midQC_min_alt,SNP_N_midQC_min_new

the number of negative strand SNPs matched against, respectively, the standard allele reference, the alternative allele reference or neither.

SNP_N_midQC_strandswitch_std, SNP_N_midQC_strandswitch_alt

the number of SNPs that were strand-switched because of a mismatch with the standard and alternative allele reference, respectively.

SNP_N_midQC_strandswitch_std_min, SNP_N_midQC_strandswitch_alt_min

the subset of previous that were negative-strand SNPs. NOTE: at this point in the QC, negative-strand SNPs have already been converted to the positive strand, i.e. they should not appear in this category. If they do, there is a problem with the reported strand, or with the reference table.

SNP_N_midQC_mismatch

the number of SNPs that were still mismatching after the strand-switch.

SNP_N_midQC_mismatch_std, SNP_N_midQC_mismatch_alt

subset of previous that were matched with the standard and alternative allele reference, respectively.

SNP_N_midQC_mismatch_std_min, SNP_N_midQC_mismatch_alt_min

subset of previous that are negative-strand SNPs.

SNP_N_midQC_flip_std, SNP_N_midQC_flip_alt, SNP_N_midQC_flip_new

Number of SNPs that were flipped (had their alleles reversed) when matched against, respectively, the standard allele reference, the alternative allele reference or neither.

SNP_N_midQC_ambiguous

the number of ambiguous SNPs

SNP_N_midQC_ambiguous_std, SNP_N_midQC_ambiguous_alt, SNP_N_midQC_ambiguous_new

subset of ambiguous SNPs matched against, respectively, the standard allele reference, the alternative allele reference or neither.

SNP_N_midQC_suspect

the subset of ambiguous SNPs whose allele frequencies differ strongly from those in the reference.

SNP_N_midQC_suspect_std, SNP_N_midQC_suspect_alt

the subsets of previous matched against the standard and alternative allele reference, respectively.

SNP_N_midQC_diffEAF

the number of SNPs whose allele frequency differs strongly from the reference.

SNP_N_midQC_diffEAF_std, SNP_N_midQC_diffEAF_alt

subset of previous that were matched with the standard and alternative allele reference, respectively.

SNP_N_postQC

the number of SNPs in the final dataset.

SNP_N_postQC_geno, SNP_N_postQC_imp

the number of genotyped and imputed SNPs in the final dataset.

SNP_N_postQC_invalid

the number of SNPs with invalid values remaining in the final dataset. Note: any invalid values have already been changed to NA at this point. This merely counts how many of those SNPs are still in the dataset.

SNP_N_postQC_min

the number of negative-strand SNPs in the final dataset. Note: all SNPs have been switched to the positive strand at this point. This merely counts how many of those SNPs are still in the dataset.

SNP_N_postQC_HQ

the number of high-quality SNPs in the final dataset.

fixed_HWE, fixed_callrate, fixed_sampleN, fixed_impQ

logical or character string; are HWE p-values, callrates, sample-size and imputation quality values identical for all relevant SNPs? If TRUE, it indicates that the parameters have not been calculated and are dummy values. If a parameter fails the threshold test, this returns "insuf. data" ("no data" for sample size).

effect_25, effect_median, effect_75

the quartile values of the effect-size distribution.

effect_mean

the mean of the effect-size distribution.

SE_median, SE_median_HQ

the median standard error of all SNPs and high-quality SNPs only, respectively.

skewness, kurtosis

the skewness and kurtosis value of the dataset.

skewness_HQ, kurtosis_HQ

the skewness and kurtosis value for high-quality SNPs only.

all_ref_std_name, all_ref_alt_name

the names used for the standard and alternative allele-references in the output.

all_MAF_std_r, all_MAF_alt_r

allele-frequency correlation with the standard and alternative allele references.

all_ambiguous_MAF_std_r, all_ambiguous_MAF_alt_r

allele-frequency correlations for the subset of ambiguous SNPs in the standard and alternative allele references, respectively.

all_non_ambig_MAF_std_r, all_non_ambig_MAF_alt_r

allele-frequency correlations for the subset of non-ambiguous SNPs in the standard and alternative allele references, respectively.

all_ref_changed

logical; has an updated alternative allele reference been saved?

effectsize_return

logical; is a vector of high-quality effect-sizes returned in effectsizes_HQ?

effectsizes_HQ

if effectsize_return is TRUE, a vector of 1000 high-quality effect-sizes; if not, NULL. If a dataset contains less than 1000 high-quality SNPs, the vector is padded with NA's to bring it to 1000 values.

pvalue_r

the correlation between reported and calculated p-values.

visschers_stat, visschers_stat_HQ

the Visscher's statistic for all SNPs and high-quality SNPs only, respectively.

columns_std_missing

the names of any missing, standard columns: if no columns are missing, this returns 0.

columns_std_empty

the names of any empty, standard columns: if no columns are empty, this returns 0.

columns_unidentified

the names of any unidentified columns in the input dataset. If none, this returns 0.

outcome_useFRQ, outcome_useHWE, outcome_useCal, outcome_useImp, outcome_useMan

logical; indicates whether the variable passed the threshold test.

…

the remaining 'setting' components return the the actual filter settings used in the QC (i.e. taking into account whether the variables passed the threshold test).

QC output files

The results of the QC are reported in a variety of .txt and .png files saved in dir_output. The files use the same output name as the dataset, with an extension to indicate their contents (i.e. '_log.txt', '_graph_QQ.png'). The .txt files are tab-delimited and are best viewed in a spreadsheet program. The most important one is the log file. This file summarizes the results of the QC and the data inside the file.

The log file

The top of the file is table of log entries, reporting any (potential) problems encountered during the QC. Some of these are just routine updates; the removal of SNPs with missing data, for example. However, do check the other entries. These report important but non-fatal problems, relating to crucial missing data or invalid data. In such a case, and provided the QC did not abort, the affected SNPs will have been exported to a .txt file before being excluded, so the user can inspect them without having to reload the entire dataset. The .txt's are:

[filename_output]_SNPs_invalid_allele2.txt

[filename_output]_SNPs_duplicates.txt

[filename_output]_SNPs_removed.txt

[filename_output]_SNPs_improbable_values.txt

(Note: the names of the files are slightly confusing: the "SNPs_removed" file contains all SNPs removed in phase 2d. This does not include monomorphic SNPs, or SNPs from excluded chromosomes, as these are removed in phase 2a. Also, the "SNPs_improbable_values" file does not include SNPs with invalid values for crucial variables, as these are already in the "SNPs_removed" file.)

Another important but non-fatal problem is missing columns. QC_GWAS uses a translation table to determine the contents of a column. If the translation table is incomplete, or contains a typo, the function will be unable to translate (and therefor use) a column. If this involves, say, callrate, it merely means the function cannot apply the callrate filters, but the absence of p-values or imputation status will disable many features of the QC. If you know that a data column is present, yet the log reports it missing, check the translation table. The QC_series checklist output and the columns_unidentified value of the QC_GWAS return report the names of any unidentified columns in the dataset.

If the QC aborts, the log file should give some indication why this occurred. However, if it was successful, there will be several other tables in the log file.

The second table reports the number of SNPs in the dataset at various stages of the QC; as well as how many (and for what reason) SNPs were removed.

The third table gives an overview of the data itself. It reports how many values were missing and invalid per variable in both the pre- and post-QC datasets, as well as the quartile and mean values of the post-QC data. A few notes on the nomenclature: invalid values will have been removed (for crucial variables) or set to missing (for non-crucial variables) in stage 2d. The post-QC 'invalid' column merely records how many of these SNPs remain in the dataset. 'Unusable' values are the missing and invalid values combined (shown as percentage of the total data). Finally, pre-QC refers to the dataset during stage 2b-c, but after monomorphic SNPs and SNPs from excluded chromosomes (stage 2a) have been removed.

The fourth table reports on the allele-matching in phase 3. The concepts are explained in 'details' and the match_alleles function; here we just mention what the user needs to pay attention to. Strand-switching counts SNPs whose strand was switched because it mismatched with the reference. As many cohorts do not add strand data (or set every SNP to "+"), the presence of such SNPs is not a red flag by itself. However, if there are mismatching SNPs (the subset of strand-switched SNPs that could not be fixed), there is a problem with the allele data (or, possibly, a triallelic SNP). Check the [filename_output]_SNPs_mismatches-[ref-name].txt file to see the affected SNPs.

Another red flag is if there are strand-switched SNPs in a dataset that also contains negative-strand SNPs (i.e. the cohort included real strand data, rather just setting it to "+" for all SNPs). Negative-strand SNPs are converted to the positive strand beforehand, so they should not appear in this step (if they do, they are counted in the "double strand-switch" entry, but that is of minor importance). The real problem is that if a cohort includes negative-strand SNPs (i.e. real strand data), and there are still strand-switches, the strand data must be incorrect. Whether the strand-switches and the negative-strand SNPs overlap is unimportant.

The third possible problem is when a large proportion of ambiguous SNPs is suspect: it indicates that they are on the wrong strand.

Finally, a large number of SNPs with a deviating allele frequency indicates either that the frequency is reported for the wrong allele (see below) or that the dataset population does not match that of the reference.

The fifth table reports the QC outcome statistics. The p-value and allele-frequency correlations should be close to 1. An allele-frequency correlation of -1 means that the frequency was reported for the wrong (non-effect) allele. As for the p-value correlaction: in a typical GWAS dataset, the expected and observed p-values should correlate perfectly. If this isn't the case, it means either that a column was misidentified when loading the dataset or that the wrong values were used when generating the data.

The sixth table reports how many SNPs were removed by the various QQ plot filters.

The seventh table reports the chromosomes and alleles present in the final dataset.

The eighth table counts invalid values in the pre- and post-QC files for several variables. 'Extreme p' is a value that is only used when calculate_missing_p is TRUE. Any newly-calculated p-values that are < 1E-300 will be set to 1E-300, in order to exclude any values of 0 (1E-300 is close to the smallest numeric value that R can handle safely).

The final four tables contain the settings of the QC.

The QC_series output

QC_series saves a checklist, showing the most important QC stats of the various files side by side, and (depending on the plot-arguments) several graphs comparing effect-size distribution, precision and skewness vs. kurtosis of the QC'ed files.

Output header

The standard-format values used by out_header are:

"standard" retains the default column names used by QC_GWAS.
"original" restores the column names used in the input file.
"old" uses the default column names of pre-v1.0b versions of QCGWAS.
"GWAMA", "PLINK", "META" and "GenABEL" set the column names to those use by the respective programs. Note that META's alleleB corresponds to EFFECT_ALL in QC_GWAS.

Requirement for the input dataset

QC_GWAS will automatically unpack .gz and .zip files, provided the filename includes the extension of the packed file. For example, if the data file is named "data1.csv", the zip file should be "data1.csv.zip". If it is named "data1.zip", QC_GWAS won't be able to "call" the file inside.

QC_GWAS is flexible when it comes to file-format. By default, it can open datasets with a variety of column separators and NA values (the user can specify these via the column_separators and na.strings arguments). Read the documentation of the auto-loader function load_GWAS for more information.

Chromosome values can be coded as numeric or character: values of "X", "Y", "XY" and "M" will automatically be converted to 23, 24, 25 and 26, respectively.

By default, imputation status is coded as integers 0 (genotyped) and 1 (imputed). As of version 1.0-4, imputation status will always be translated using the imputed_T, imputed_F and imputed_NA arguments. This means that the user must specify values for these arguments, even when the dataset already uses the standard format. Because of the importance of imputation status, if the function is unable to translate character values, the QC will abort.

The minimal and maximal valid imputation quality values are determined by the minimal_impQ_value and maximal_impQ_value arguments.

Standard errors and p-values of 0 are considered invalid and removed in phase 2d, while values of -1 will be set to NA. Effect sizes of -1 are accepted, unless the standard error and/or the p value are also -1, in which case it is also set to NA.

Filter arguments

QC_GWAS has three sets arguments relating to filters: the arguments for the HQ (high-quality), QQ (plot) and NA (missing values) filters. The HQ and QQ arguments work mostly in the same way, but there are a few key differences.

The high-quality arguments accept single, numeric values that determine the minimal values of allele-frequency (FRQ), HWE p-value (HWE), callrate (cal) and imputation quality (imp) for a SNP to be of high-quality. The high-quality filter is used for the effect-size boxplot and the Manhattan plot, as well as several QC stats.

The QQ arguments accept a vector of max. 5 numeric values that are applied sequentially as filters in the QQ plots.

Both of these use the NA filter argument(s) to determine whether to exclude or ignore missing values.

Neither filter is used to remove SNPs; they merely exclude them from several QC tests. Both HQ and QQ filter criteria are only applied if the variable passed the threshold test, i.e. if there are sufficient non-missing, non-invalid values for the filter to be applied (see the use_threshold argument for details). It's pointless to filter an empty column.

If ignore_impstatus is FALSE (default), the imputation-quality criterion is only applied to imputed SNPs, and the HWE p-value and callrate criteria only to genotyped SNPs. If TRUE, the filters are applied to all SNPs, regardless of the imputation status.

The allele-frequency filters are two-sided: when set to value x, SNPs with frequency < x or > 1 - x are excluded.

To filter missing values only, set the filter to NA and the corresponding NA filter argument to TRUE. To disable entirely, set to NULL (this means the NA-filter setting is ignored as well).

The differences between the HQ and QQ filters are:

The HQ filter arguments accept a single value, the QQ filters can accept up to 5.

The HQ filter is a single filter: a SNP needs to meet all relevant criteria to be considered high-quality. The QQ filter values are applied separately.

The QQ filter has an additional feature: if passed a value x > 1, it will calculate a filter value of x / sample size. This is to allow size-based filtering of allele frequencies. Note that this filter uses the sample size listed for that specific SNP. If the sample size is missing, the relevant NA-filter setting is used to determine whether it should be excluded.

Updating the alternative reference

There are two drawbacks to the way the function updates the alternative reference file. One is a technical issue, but the other can affect the QC of subsequent files.

Firstly, the argument update_alt has a slightly misleading name: the alternative-reference file is updated, but the reference inside the R memory is not. If the user wants to do further QCs with the updated reference, (s)he will have to manually reload the updated file into R.

This is caused by the way R handles data-alterations that occur inside a function. Any changes made to data last only for the duration of the function. Once the function terminates, the memory reverts to its original state. In other words: the allele reference is updated inside QC_GWAS, but goes back to the pre-QC state once the QC is finished. QC_series deals with this by automatically reloading the reference file whenever it is updated, but, again, once the function terminates it will revert to its original state.

The second drawback is that the content of the alternative reference is arbitrary, depending on which file an unknown SNP is encountered in first. For example, suppose that SNP rs31 has alleles A and G, an allele frequency that varies around 0.5, and does not appear in the standard reference. When it is added to the alternative reference, the allele listed as the minor one depends entirely on the allele frequency in the first file it is encountered in.

More seriously, if the information in the first file is incorrect, the SNP may be strand-switched or excluded in subsequent files because it does not match the (incorrect) reference. This is another reason why it is important to check the log files: if there is a problem with a datafile's strand, alleles or allele-frequency, and the alternative reference was updated, the incorrect data may have been added to the reference. If so, one should go back to a previous reference file. The argument backup_alt is useful for this, though note that QC_series only does this the first time the reference is updated.

Also, if one wants to QC a large number of files for a meta-GWAS, one should use the same alternative allele reference file (and let QC_GWAS update it) for every file, otherwise it is possible that rs13 may have a different effect alleles in some post-QC files.

Details

The full quality-control carried out by QC_GWAS consists of 5 phases. The function takes a single dataset (or, rather, the location and filename of a single dataset) and runs it through the following phases:

1: Importing the dataset
2: Checking data integrity
3: Checking alleles
4: Generating QC statistics and graphs
5: Saving the output

Phase 1: importing the dataset

GWAS results files come in a variety of formats, so QC_GWAS is flexible about loading data. It uses an autoloader function (load_GWAS) to unpack .zip or .gz files and determine the column-separator used in the file. See the section 'Requirement for the input dataset' for more information.

Next, the function attempts to translate the dataset's column names (the header) to standard names, so that it knows what type of data a column contains. This is done by comparing the column names to a translation table (specified in the header_translations argument). See translate_header for more information.

Note that only the SNP ID, alleles, effect-size and standard error columns are required. The absence of other standard columns (chromosome, position, strand, allele frequency, HWE p-value, callrate, imputation quality, imputation status and used for imputation) will not cause the QC to abort. Instead, a warning is printed on screen and in the log file, and a dummy column filled with NA values is added to the dataset.

It is therefor important to check the log file: if a standard column is present but not identified (because it is missing or misspelled in the translation table) the QC will continue, but is unable to check/use the data inside. The unidentified column will be reported in the columns_unidentified value of the QC_GWAS return or in the "QC_checklist.txt" file generated by QC_series.

Phase 2: checking data integrity

The purpose of phase 2 is to ensure that the dataset can be QC'ed: that that all SNPs have the required data and that all columns contain only valid (or missing) values.

The first step is to remove SNPs that won't be used: monomorphic SNPs and (if specified by the arguments) allosomal, pseudo-autosomal and mitochondrial SNPs. The function considers SNPs monomorphic if they have a missing or invalid other (non-effect) allele, identical alleles or an allele-frequency of 1 or 0.

The second step is to check the imputation status column with the function convert_impstatus. See the section 'Requirement for the input dataset' for more information. Imputation status is one of the most important variables in the dataset: if unknown, the HWE p-value, callrate and imputation quality won't be used (unless ignore_impstatus is TRUE), as the function cannot determine which SNPs are imputed and which are not. For this reason, if convert_impstatus is unable to translate any character string in the column, the QC will abort.

The third step carries out three tests for all other standard variables:

Does the column contain the correct type of data (integer, numeric or character)?
How many values are missing (NA)?
How many values are invalid (= impossible)?

The exact nature of the three tests differs per variable: see the documentation file in "R\library\QCGWAS\doc" for more detail.

The presence of the wrong data-type will cause the QC to abort. Wrong data-type indicates either a problem in the file itself, or with the way it was imported (in which case it is most likely due to a mistranslated header).

The final step is the removal of the invalid values and of unusable SNPs. The variables MARKER, EFFECT_ALLELE, OTHER_ALLELE, EFFECT and STDERR are considered crucial. SNPs with missing or invalid values in any of these variables are removed the dataset. Missing values in the other variables are ignored, while invalid values are set to NA.

Phase 3: checking alleles

This phase has three functions:

To check if the correct alleles are reported for each SNP
To check if the allele-frequency is reported for the correct (effect) allele
To ensure that SNPs are aligned to the positive strand and use the same effect-allele in all post-QC datasets

This is achieved by comparing the data to a reference, using the function match_alleles. First, all SNPs are switched to the positive strand (the alleles are converted to their opposing base and the strand-value is set to "+"). If there are SNPs whose allele pair doesn't match the reference, match_alleles assumes the information in the strand column is absent or incorrect, and will also switch those SNPs to the other (presumably positive) strand. This step is referred to as strand-switching in QC output, and is independent from the negative-strand SNP conversion. It is therefor possible that a SNP is switched twice: once because the strand-column indicates it is on the negative strand, and twice because of a mismatch. This is referred to as double strand-switch in the output, and indicates either the wrong value in the strand column, or a mismatch with the reference. In the latter case, it will most likely be picked up in the next step.

If the strand-switch does not fix the mismatch, the SNPs are counted in the QC output as mismatches. Depending on the remove_mismatches arguments, the SNPs will either be removed from the dataset, or left in but excluded from the further tests of the allele-matching.

Next, match_alleles "flips" SNPs so that their effect allele matches the reference minor allele. This ensures that a SNP will have the same effect allele in all post-QC datasets.

match_alleles also counts the number of SNPs with a strand-independent allele configuration (A/T or C/G; these are designated as "ambiguous SNPs"), and the subset of those with an allele-frequency that is substantially different from the reference ("suspect SNPs"). If a substantial proportion of ambiguous SNPs is suspect, it indicates that the strand information is incorrect. In our experience, a regular, 2.5M SNP dataset usually consists of 15% ambiguous SNPs, of which a few dozen will be suspect.

The function also counts the number of SNPs whose allele frequency differs from the reference by more than a set amount (threshold_difEAF). If the relevant remove_diffEAF argument is TRUE, these SNP will be excluded from the dataset after the allele-matching is finished.

The final step is to correlate the reported allele-frequency against the reference. If allele-frequency is reported for the correct (effect) allele, the correlation should be close to 1. If the outcome is close to -1, the reported frequency is that of the other allele. Depending on the plot settings, a scatter plot of reported vs. reference frequency is made for all SNPs, and for the subsets of ambiguous and non-ambiguous SNPs.

The standard and alternative allele reference

The above steps describe what happens when the dataset is compared to a single reference. However, we found that many GWAS datasets contain SNPs not present in our standard HapMap reference, so we added a second, flexible reference that can be updated with any unknown SNPs the QC encounters.

SNPs that are not found in either reference are converted to the positive strand, and "flipped" if their allele frequency is > 0.50. If update_alt is TRUE, these SNPs are then added to the alternative reference and saved under the name update_savename. There are a few caveats to this system: see the section 'Updating the alternative reference' for details.

Phase 4: generating QC statistics and graphs

At this stage, no further changes will be made to the dataset (except, optionally, to recalculate missing p-values). The function will now start to calculate the QC statistics and generate the important graphs. These are:

Create histograms of variable distribution (optional)
Check p-values by correlating them to a p calculated from the effect-size and standard-error (via the check_P function).
Recalculate missing/invalid p-values (optional)
Calculate QC statistics
Create QQ and Manhattan plots (optional, see QC_plots function for more information).

Phase 5: saving the output

A series of tables is added to the bottom of the log file, reporting the QC statistics and the data distribution. If save_final_dataset is TRUE, the post-QC data will be exported as a .txt file. The column names and format of that file can be specified by the out arguments.

Examples

Run this code

# NOT RUN {
## For instructions on how to run QC_GWAS and QC_series
## check the quick start guide in /R/library/QCGWAS/doc
# }

Run the code above in your browser using DataLab