Convert to tidy data frames: Convert vcfR objects to tidy data frames

Description

Convert the information in a vcfR object to a long-format data frame suitable for analysis or use with Hadley Wickham's packages, dplyr, tidyr, and ggplot2. These packages have been optimized for operation on large data frames, and, though they can bog down with very large data sets, they provide a good framework for handling and filtering large variant data sets. For some background on the benefits of such "tidy" data frames, see this article.

For some filtering operations, such as those where one wants to filter genotypes upon GT fields in combination with INFO fields, or more complex operations in which one wants to filter loci based upon the number of individuals having greater than a certain quality score, it will be advantageous to put all the information into a long format data frame and use dplyr to perform the operations. Additionally, a long data format is required for using ggplot2. These functions convert vcfR objects to long format data frames.

Usage

vcfR2tidy(
  x,
  info_only = FALSE,
  single_frame = FALSE,
  toss_INFO_column = TRUE,
  ...
)
extract_info_tidy(x, info_fields = NULL, info_types = TRUE, info_sep = ";")
extract_gt_tidy(
  x,
  format_fields = NULL,
  format_types = TRUE,
  dot_is_NA = TRUE,
  alleles = TRUE,
  allele.sep = "/",
  gt_column_prepend = "gt_",
  verbose = TRUE
)
vcf_field_names(x, tag = "INFO")

Arguments

an object of class vcfR

info_only

if TRUE return a list with only a fix component (a single data frame that has the parsed INFO information) and a meta component. Don't extract any of the FORMAT fields.

single_frame

return a single tidy data frame in list component dat rather returning it in components fix and/or gt.

toss_INFO_column

if TRUE (the default) the INFO column will be removed from output as its consituent parts will have been parsed into separate columns.

...

more options to pass to extract_info_tidy and extract_gt_tidy. See parameters listed below.

info_fields

names of the fields to be extracted from the INFO column into a long format data frame. If this is left as NULL (the default) then the function returns a column for every INFO field listed in the metadata.

info_types

named vector of "i" or "n" if you want the fields extracted from the INFO column to be converted to integer or numeric types, respectively. When set to NULL they will be characters. The names have to be the exact names of the fields. For example info_types = c(AF = "n", DP = "i") will convert column AF to numeric and DP to integer. If you would like the function to try to figure out the conversion from the metadata information, then set info_types = TRUE. Anything with Number == 1 and (Type == Integer or Type == Numeric) will then be converted accordingly.

info_sep

the delimiter used in the data portion of the INFO fields to separate different entries. By default it is ";", but earlier versions of the VCF standard apparently used ":" as a delimiter.

format_fields

names of the fields in the FORMAT column to be extracted from each individual in the vcfR object into a long format data frame. If left as NULL, the function will extract all the FORMAT columns that were documented in the meta section of the VCF file.

format_types

named vector of "i" or "n" if you want the fields extracted according to the FORMAT column to be converted to integer or numeric types, respectively. When set to TRUE an attempt to determine their type will be made from the meta information. When set to NULL they will be characters. The names have to be the exact names of the format_fields. Works equivalently to the info_types argument in extract_info_tidy, i.e., if you set it to TRUE then it uses the information in the meta section of the VCF to coerce to types as indicated.

dot_is_NA

if TRUE then a single "." in a character field will be set to NA. If FALSE no conversion is done. Note that "." in a numeric or integer field (according to format_types) with Number == 1 is always going to be set to NA.

alleles

if TRUE (the default) then this will return a column, gt_GT_alleles that has the genotype of the individual expressed as the alleles rather than as 0/1.

allele.sep

character which delimits the alleles in a genotype (/ or |) to be passed to extract.gt. Here this is not used for a regex (as it is in other functions), but merely for output formatting.

gt_column_prepend

string to prepend to the names of the FORMAT columns

verbose

logical to specify if verbose output should be produced in the output so that they do not conflict with any INFO columns in the output. Default is "gt_". Should be a valid R name. (i.e. don't start with a number, have a space in it, etc.)

tag

name of the lines in the metadata section of the VCF file to parse out. Default is "INFO". The only other one tested and supported, currently is, "FORMAT".

Value

An object of class tidy::data_frame or a list where every element is of class tidy::data_frame.

Details

The function vcfR2tidy is the main function in this series. It takes a vcfR object and converts the information to a list of long-format data frames. The user can specify whether only the INFO or both the INFO and the FORMAT columns should be extracted, and also which INFO and FORMAT fields to extract. If no specific INFO or FORMAT fields are asked for, then they will all be returned. If single_frame == FALSE and info_only == FALSE (the default), the function returns a list with three components: fix, gt, and meta as follows:

fix A data frame of the fixed information columns and the parsed INFO columns, and an additional column, ChromKey---an integer identifier for each locus, ordered by their appearance in the original data frame---that serves together with POS as a key back to rows in gt.
gt A data frame of the genotype-related fields. Column names are the names of the FORMAT fields with gt_column_prepend (by default, "gt_") prepended to them. Additionally there are columns ChromKey, and POS that can be used to associate each row in gt with a row in fix.
meta The meta-data associated with the columns that were extracted from the INFO and FORMAT columns in a tbl_df-ed data frame.

This is the default return object because it might be space-inefficient to return a single tidy data frame if there are many individuals and the CHROM names are long and/or there are many INFO fields. However, if single_frame = TRUE, then the results are returned as a list with component meta as before, but rather than having fix and gt as before, both those data frames have been joined into component dat and a ChromKey column is not returned, because the CHROM column is available.

If info_only == FALSE, then just the fixed columns and the parsed INFO columns are returned, and the FORMAT fields are not parsed at all. The return value is a list with components fix and meta. No column ChromKey appears.

The following functions are called by vcfR2tidy but are documented below because they may be useful individually.

The function extract_info_tidy let's you pass in a vector of the INFO fields that you want extracted to a long format data frame. If you don't tell it which fields to extract it will extract all the INFO columns detailed in the VCF meta section. The function returns a tbl_df data frame of the INFO fields along with with an additional integer column Key that associates each row in the output data frame with each row (i.e. each CHROM-POS combination) in the original vcfR object x.

The function extract_gt_tidy let's you pass in a vector of the FORMAT fields that you want extracted to a long format data frame. If you don't tell it which fields to extract it will extract all the FORMAT columns detailed in the VCF meta section. The function returns a tbl_df data frame of the FORMAT fields with an additional integer column Key that associates each row in the output data frame with each row (i.e. each CHROM-POS combination), in the original vcfR object x, and an additional column Indiv that gives the name of the individual.

The function vcf_field_names is a helper function that parses information from the metadata section of the VCF file to return a data frame with the metadata information about either the INFO or FORMAT tags. It returns a tbl_df-ed data frame with column names: "Tag", "ID", "Number","Type", "Description", "Source", and "Version".

Examples

Run this code

# NOT RUN {
# load the data
data("vcfR_test")
vcf <- vcfR_test


# extract all the INFO and FORMAT fields into a list of tidy
# data frames: fix, gt, and meta. Here we don't coerce columns
# to integer or numeric types...
Z <- vcfR2tidy(vcf)
names(Z)


# here is the meta data in a table
Z$meta


# here is the fixed info
Z$fix


# here are the GT fields.  Note that ChromKey and POS are keys
# back to Z$fix
Z$gt


# Note that if you wanted to tidy this data set even further
# you could break up the comma-delimited columns easily
# using tidyr::separate




# here we put the data into a single, joined data frame (list component
# dat in the returned list) and the meta data.  Let's just pick out a 
# few fields:
vcfR2tidy(vcf, 
          single_frame = TRUE, 
          info_fields = c("AC", "AN", "MQ"), 
          format_fields = c("GT", "PL"))


# note that the "gt_GT_alleles" column is always returned when any
# FORMAT fields are extracted.




# Here we extract a single frame with all fields but we automatically change
# types of the columns according to the entries in the metadata.
vcfR2tidy(vcf, single_frame = TRUE, info_types = TRUE, format_types = TRUE)




# for comparison, here note that all the INFO and FORMAT fields that were
# extracted are left as character ("chr" in the dplyr summary)
vcfR2tidy(vcf, single_frame = TRUE)





# Below are some examples with the vcfR2tidy "subfunctions"


# extract the AC, AN, and MQ fields from the INFO column into
# a data frame and convert the AN values integers and the MQ
# values into numerics.
extract_info_tidy(vcf, info_fields = c("AC", "AN", "MQ"), info_types = c(AN = "i", MQ = "n"))

# extract all fields from the INFO column but leave 
# them as character vectors
extract_info_tidy(vcf)

# extract all fields from the INFO column and coerce 
# types according to metadata info
extract_info_tidy(vcf, info_types = TRUE)

# get the INFO field metadata in a data frame
vcf_field_names(vcf, tag = "INFO")

# get the FORMAT field metadata in a data frame
vcf_field_names(vcf, tag = "FORMAT")



# }