match_alleles: Check and correct alleles in GWAS result files

Description

This function checks the reported alleles and allele frequencies in GWAS results data by comparing them to a reference table. It will also uniformize the dataset by switching all SNPs to the positive strand and flipping the alleles so that the effect allele matches the reference minor allele.

Usage

match_alleles(dataset, ref_set, HQ_subset,
              dataname = "dataset", ref_name = "reference",
       unmatched_data = !all(dataset$MARKER %in% ref_set$SNP),
              check_strand = FALSE,
              save_mismatches = TRUE, delete_mismatches = FALSE,
              delete_diffEAF = FALSE, threshold_diffEAF = 0.15,
              check_FRQ = TRUE, check_ambiguous = FALSE,
              plot_FRQ = FALSE, plot_intensity = FALSE, 
              plot_if_threshold = FALSE,
              threshold_r = 0.95,
              return_SNPs = FALSE, return_ref_values = FALSE,
              header_translations, header_reference,
              save_name = dataname, save_dir = getwd(),
              use_log = FALSE, log_SNPall = nrow(dataset))

Arguments

dataset

table containing the allele data. dataset should always contain columns for the SNPID, the effect allele and the other allele. Strand and allele frequency may be required, depending upon the settings, while effect size is optional. match_alleles accepts non-standard column names, provided a translation table is specified in header_translations. The order of columns does not matter; nor does the presence of other columns.

ref_set

table containing the reference data. ref_set should always contain columns for the SNPID, the minor allele and the major allele. Minor allele frequency is only required if check_FRQ or check_ambiguous are TRUE. The standard column-names are "SNP", "MINOR", "MAJOR" and "MAF". Non-standard column names are accepted if a translation table is specified in header_reference. All SNPs must be aligned to the positive strand.

HQ_subset

an optional logical or numeric vector indicating the rows in dataset that contain high quality SNPs.

dataname, ref_name

character strings; the names of the dataset and reference, respectively. Used as identifiers in the output files.

unmatched_data

logical; are there SNPs in the dataset that do not appear in the reference? This argument is currently redundant: the function will determine it automatically.

check_strand

logical; should the function check for negative-strand SNPs? If FALSE, all SNPs are assumed to be on the positive strand.

save_mismatches

logical; should mismatching entries be exported to a .txt file before they are corrected?

delete_mismatches

logical; should mismatching SNPs (that could not be corrected by strand-switching) have their effect allele set to missing?

delete_diffEAF

logical; should SNPs that exceed the threshold_diffEAF have their effect allele set to missing?

threshold_diffEAF

numeric; the max. allowed difference between reported and reference allele frequency.

check_FRQ

logical; should the function correlate the reported allele-frequency with that of the reference?

check_ambiguous

logical; should the function do separate frequency correlations and create separate plots for SNPs with a strand-independent allele-pair (i.e. an A/T or C/G configuration)?

plot_FRQ

logical; should a scatterplot of reported vs. reference allele-frequency be made?

plot_intensity

logical; if TRUE, instead of a scatterplot an intensity plot is generated. This option is currently only partially implemented. Leave to FALSE for now.

plot_if_threshold

logical; if TRUE, the scatterplot is only made when frequency correlation is below the threshold specified by threshold_r.

threshold_r

numeric; the correlation threshold value.

return_SNPs

logical; should the return value include the relevant columns of dataset?

return_ref_values

logical; should the return-value include the matching entries in ref_set?

header_translations, header_reference

translation tables for converting the column names of dataset and ref_set to standard names, respectively. See translate_header for more information.

save_name

character string; the filename, without extension, for the various output files.

save_dir

character string; the directory where the output files are saved. Note that R uses forward slash (/) where Windows uses backslash (\).

use_log, log_SNPall

arguments used by QC_GWAS; redundant when match_alleles is used separately.

Value

An object of class 'list' with the following components:

FRQ_cor, FRQ_cor_ambiguous, FRQ_cor_nonambi

Allele-frequency correlations for all, ambiguous, and non-ambiguous SNPs respectively.

n_SNPs

Total number of SNPs in dataset

n_missing, n_missing_data, n_missing_ref

Number of SNPs with missing allele-data in either dataset or reference, dataset only, and reference only, respectively.

n_negative_strand, n_negative_switch, n_negative_mismatch

Number of negative-strand SNPs, the subset of negative-strand SNPs that were strand-switched twice because they did not match the reference, and the subset of double-switched SNPs that were still mismatching after the second strand-switch.

n_strandswitch, n_mismatch

Number of SNPs that was strand-switched because they did not match the reference, and the subset of those that still did not match after the strand-switch.

n_flipped

Number of SNPs whose alleles were flipped to align them with the reference.

n_ambiguous, n_suspect

Number of ambiguous SNPs, and the subset of those that had a large allele-frequency aberration.

n_diffEAF

Number of SNPs whose allele-frequency differs from the reference by more than threshold_diffEAF.

MARKER

When return_SNPs and/or return_ref_values is TRUE, this returns the column of dataset containing the SNP IDs. If not, this returns NULL.

EFFECT_ALL, OTHER_ALL, STRAND, EFFECT, EFF_ALL_FREQ

If return_SNPs is TRUE, these elements return the corrected columns of data-set. If FALSE, these return NULL. Note: match_alleles only returns those columns that were checked; if check_FRQ is FALSE, EFF_ALL_FREQ return NULL. The same goes for check_strand and STAND. EFFECT is only returned if present in dataset.

ref_MINOR, ref_MAJOR, ref_MAF

If return_ref_values is TRUE, these elements return the reference minor and major alleles and allele frequency column for the SNPs in MARKER.If FALSE, these return NULL. ref_MAF is only returned when check_FRQ is TRUE.

Interpreting the output

The output of match_alleles may seem a bit overwhelming at first, so here is a short explanation of what it means and what you should pay attention to.

The columns included in the return value when return_SNPs is TRUE are the post-matching dataset. This is only relevant if you want to continue working with the corrected dataset. Similarly, the output of return_ref_values is only used for comparing the post-matching dataset to the reference.

n_missing, n_missing_data and n_missing_ref report the prevalence of missing allele data, but are otherwise irrelevant.

The majority of return values serve to check whether strand-switching was performed correctly. n_strandswitch indicates how many SNPs were converted to the other strand because of a mismatch with the reference. In our experience, many cohorts do not include stand data, or simply set all SNPs to "+", so the presence of strand-switched SNPs isn't an indicator of problems by itself. However, if the strand-switching did not fix the mismatch, there may a problem. The subset of strand-switched SNPs that could not be fixed is reported as n_mismatch, and indicates incorrect allele data or, possibly, triallelic SNPs.

Depending on the argument save_mismatches, mismatching entries are exported as a .txt file, together with the reference data. This allows the user to see which SNPs are affected.

Another sign of trouble is when negative-strand SNPs (n_negative_strand) are present (i.e. the cohort included real strand data, rather just setting it to "+"), but strand-switching still occurred. Negative-strand SNPs are converted to the positive strand before their alleles are compared to the reference, so they should not appear here. If they do, it means that either the strand-column data is incorrect, or it is an ordinary mismatch (see above).

Negative-strand SNPs that are "strand-switched" will revert to their original allele configuration (but the strand column now reports them as being on positive strand). The output of QC_GWAS calls them double strand-switches, but here they are reported as n_negative_switch. The subset of those that could not be fixed is n_negative_mismatch.

Just to be clear: the relevant output is still n_strandswitch and n_negative_strand, not n_negative_switch and n_negative_mismatch. Whether it was the negative-strand SNPs that were switched or not is not important: the important thing is that there were negative-strand SNPs (i.e. the cohort included real strand data rather setting everything to "+"); yet strand-switches were still necessary and cannot be attributed to mismatch (i.e. the strand data is incorrect).

n_flipped counts how many SNPs had their alleles reversed to match the effect allele with the reference minor-allele. This is merely recorded for "administrative" purposes, and shouldn't concern the user.

n_ambiguous and n_suspect are another test of the strand information. Ambiguous SNPs are SNPs that have the same allele pair on the positive and negative strands (i.e. A/T or C/G). Matching them with the allele-reference therefor won't detected incorrect strand-information. In a normal-sized GWAS results file, about 15% of SNPs will be ambiguous.

Suspect SNPs are the subset of ambiguous SNPs whose allele frequency is significantly different from that in the reference ( < 0.35. vs > 0.65 or visa versa). In our experience, a GWAS results file with 2.5M SNPs will have only a few dozen suspect SNPs. However, if it's a sizable proportion of all ambiguous SNPs, it indicates that the ambiguous SNPs are listed for the wrong strand. This will also have resulted in the wrong SNPs being flipped in the previous step, so it should be visible in the allele-frequency correlation test as well.

n_diffEAF counts SNPs with significantly different allele-frequencies. A large number here indicates either that the allele-frequencies are incorrect or listed for the wrong allele (see below), or that the population used in the dataset does not match that of the reference.

The FRQ_cor value is the correlation between the reported and reference allele-frequencies. If allele frequency is correct, the correlation should be near 1. If it's close to -1, the listed frequency is that of the other (i.e. non-effect) allele.

the FRQ_cor_ambiguous and FRQ_cor_nonambi values are the same test for the subsets of ambiguous and non-ambiguous SNPs. If ambiguous SNPs are listed on the wrong strand, then they will have been flipped incorrectly, so allele-frequency correlation should also move towards -1.

Details

match_alleles is one of the more complicated functions of QCGWAS. However, what it does is quite simple:

Check for incorrect allele pairs
Uniformize the output so that all SNPs are on the positive strand, and identical SNPs will have the same effect allele and other allele in all datasets
Check the allele frequency

The complexity stems from the fact that these three tasks have to be carried out together and often overlap. So the actual function schematic looks like this:

Switch negative-strand SNPs (i.e. SNPs with "-" in the strand-column) to the positive strand. This step can be disabled by setting check-strand to FALSE.
Correct (if possible) mismatching alleles. A mismatch is when the reported allele-pair does not match that in the reference. match_alleles will attempt to fix the mismatch by "strand-switching" the alleles. The assumption is that dataset merely reported the wrong strand, so converting them to the opposing strand should solve the mismatch. If it does not, the SNPs are truly mismatches. If delete_mismatches is TRUE, the effect alleles are set to NA; if FALSE, they are restored to their original configuration and excluded from the allele-frequency test. If save_mismatches is TRUE, the entries are exported in a .txt before being changed. Note that save_mismatches only exports true mismatches (i.e. not those that were fixed after strand-switching).
Align the allele-pairs with the reference. In order to have the same effect allele with the same SNP in every dataset, SNPs are "flipped" so that the effect allele matches the reference minor allele. Flipped alleles will also have their allele frequency and effect size inverted.
Check for undetected strand-mismatch by counting the number of ambiguous and (if check_FRQ is TRUE) suspect SNPs. Ambiguous SNPs are SNPs with an allele-pair that is identical on both strands (i.e. A/T or C/G). Suspect SNPs are ambiguous SNPs whose allele-frequency differs strongly from that of the reference.
Check allele-frequencies by correlating and/or plotting them against the reference. If check_ambiguous is TRUE, additional scatterplots will be made for the subsets of ambiguous and non-ambiguous SNPs. If delete_diffEAF is TRUE, SNPs whose allele-frequency differs from the reference by more than threshold_diffEAF have their effect alleles set to NA as well. This entire step can be disabled by setting check_FRQ to FALSE.

Examples

Run this code

# NOT RUN {
# In order to keep the QCGWAS package small, no allele reference
# is included. Use the create_hapmap_reference function (see
# above) to create one.

# }
# NOT RUN {
  data("gwa_sample")
  hapmap_ref <- read.table("C:/new_hapmap/new_hapmap.txt",
                  header = TRUE, stringsAsFactors = FALSE)

  match_alleles(gwa_sample, hapmap_ref,
    dataname = "sample data", ref_name = "HapMap",
    save_name = "test_allele1", save_dir = "C:/new_hapmap",
    check_strand = TRUE, plot_FRQ = TRUE)

 HQ_SNPs <- HQ_filter(data = gwa_sample, filter_NA = TRUE,
                          filter_FRQ = 0.01, filter_cal = 0.95)
  match_alleles(gwa_sample, hapmap_ref,
    HQ_subset = HQ_SNPs,
    dataname = "sample data", ref_name = "HapMap",
    save_name = "test_allele2", save_dir = "C:/new_hapmap",
    check_strand = TRUE, plot_FRQ = TRUE)

  match_output <-
    match_alleles(gwa_sample, hapmap_ref,
      HQ_subset = HQ_SNPs,
      delete_mismatches = TRUE, return_SNPs = TRUE,
      delete_diffEAF = TRUE, threshold_diffEAF = 0.15,
      dataname = "sample data", ref_name = "HapMap",
      save_name = "test_allele3", save_dir = "C:/new_hapmap",
      check_strand = TRUE, plot_FRQ = TRUE)
  
  if(sum(match_output$n_negative_strand,
         match_output$n_strandswitch, match_output$n_mismatch,
         match_output$n_flipped, match_output$n_diffEAF) > 0){
    gwa_sample$EFFECT_ALL   <- match_output$EFFECT_ALL
    gwa_sample$OTHER_ALL    <- match_output$OTHER_ALL
    gwa_sample$STRAND       <- match_output$STRAND
    gwa_sample$EFFECT       <- match_output$EFFECT
    gwa_sample$EFF_ALL_FREQ <- match_output$EFF_ALL_FREQ
  }
  
# }

Run the code above in your browser using DataLab