vcfR (version 1.9.0)

extract.gt: Extract elements from vcfR objects

Description

Extract elements from the 'gt' slot, convert extracted genotypes to their allelic state, extract indels from the data structure or extract elements from the INFO column of the 'fix' slot.

Usage

extract.gt(
  x,
  element = "GT",
  mask = FALSE,
  as.numeric = FALSE,
  return.alleles = FALSE,
  IDtoRowNames = TRUE,
  extract = TRUE,
  convertNA = TRUE
)

extract.haps(x, mask = FALSE, unphased_as_NA = TRUE, verbose = TRUE)

is.indel(x)

extract.indels(x, return.indels = FALSE)

extract.info(x, element, as.numeric = FALSE, mask = FALSE)

Arguments

x

An object of class chromR or vcfR

element

element to extract from vcf genotype data. Common options include "DP", "GT" and "GQ"

mask

a logical indicating whether to apply the mask (TRUE) or return all variants (FALSE). Alternatively, a vector of logicals may be provided.

as.numeric

logical, should the matrix be converted to numerics

return.alleles

logical indicating whether to return the genotypes (0/1) or alleles (A/T)

IDtoRowNames

logical specifying whether to use the ID column from the FIX region as rownames

extract

logical indicating whether to return the extracted element or the remaining string

convertNA

logical indicating whether to convert "." to NA.

unphased_as_NA

logical specifying how to handle unphased genotypes

verbose

should verbose output be generated

return.indels

logical indicating whether to return indels or not

Details

The function extract.gt isolates elements from the 'gt' portion of vcf data. Fields available for extraction are listed in the FORMAT column of the 'gt' slot. Because different vcf producing software produce different fields the options will vary by software. The mask parameter allows the mask to be implemented when using a chromR object. The 'as.numeric' option will convert the results from a character to a numeric. Note that if the data is not actually numeric, it will result in a numeric result which may not be interpretable. The 'return.alleles' option allows the default behavior of numerically encoded genotypes (e.g., 0/1) to be converted to their nucleic acid representation (e.g., A/T). Note that this is not used for a regular expression as similar parameters are used in other functions. Extract allows the user to extract just the specified element (TRUE) or every element except the one specified.

Note that when 'as.numeric' is set to 'TRUE' but the data are not actually numeric, unexpected results will likely occur. For example, the genotype field will typically be populated with values such as "0/1" or "1|0". Although these may appear numeric, they contain a delimiter (the forward slash or the pipe) that is non-numeric. This means that there is no straight forward conversion to a numeric and unexpected values should be expected.

The function extract.haps uses extract.gt to isolate genotypes. It then uses the information in the REF and ALT columns as well as an allele delimiter (gt_split) to split genotypes into their allelic state. Ploidy is determined by the first non-NA genotype in the first sample.

The VCF specification allows for genotypes to be delimited with a '|' when they are phased and a '/' when unphased. This becomes important when dividing a genotype into two haplotypes. When the alleels are phased this is straight forward. When the alleles are unphased it presents a decision. The default is to handle unphased data by converting them to NAs. When unphased_as_NA is set to TRUE the alleles will be returned in the order they appear in the genotype. This does not assign each allele to it's correct chromosome. It becomes the user's responsibility to make informed decisions at this point.

The function is.indel returns a logical vector indicating which variants are indels (variants where an allele is greater than one character).

The function extract.indels is used to remove indels from SNPs. The function queries the 'REF' and 'ALT' columns of the 'fix' slot to see if any alleles are greater than one character in length. When the parameter return_indels is FALSE only SNPs will be returned. When the parameter return_indels is TRUE only indels will be returned.

The function extract.info is used to isolate elements from the INFO column of vcf data.

See Also

is.polymorphic

Examples

Run this code
# NOT RUN {
data(vcfR_test)
gt <- extract.gt(vcfR_test)
gt <- extract.gt(vcfR_test, return.alleles = TRUE)

data(vcfR_test)
is.indel(vcfR_test)


data(vcfR_test)
getFIX(vcfR_test)
vcf <- extract.indels(vcfR_test)
getFIX(vcf)
vcf@fix[nrow(vcf@fix),'ALT'] <- ".,A"
vcf <- extract.indels(vcf)
getFIX(vcf)

data(vcfR_test)
vcfR_test@fix[1,'ALT'] <- "<NON_REF>"
vcf <- extract.indels(vcfR_test)
getFIX(vcf)

data(vcfR_test)
extract.haps(vcfR_test, unphased_as_NA = FALSE)
extract.haps(vcfR_test)


# }

Run the code above in your browser using DataCamp Workspace