Learn R Programming

VariantAnnotation (version 1.18.1)

genotypeToSnpMatrix: Convert genotype calls from a VCF file to a SnpMatrix object

Description

Convert an array of genotype calls from the "GT", "GP", "GL" or "PL" FORMAT field of a VCF file to a SnpMatrix.

Usage

## S3 method for class 'CollapsedVCF':
genotypeToSnpMatrix(x, uncertain=FALSE, ...)
## S3 method for class 'array':
genotypeToSnpMatrix(x, ref, alt, ...)

Arguments

x
A CollapsedVCF object or a array of genotype data from the "GT", "GP", "GL" or "PL" FORMAT field of a VCF file. This array is created with a call to readVcf and can be accessed with geno().
uncertain
A logical indicating whether the genotypes to convert should come from the "GT" field (uncertain=FALSE) or the "GP", "GL" or "PL" field (uncertain=TRUE).
ref
A DNAStringSet of reference alleles.
alt
A DNAStringSetList of alternate alleles.
...
Additional arguments, passed to methods.

Value

  • A list with the following elements,
  • genotypesThe output genotype data as an object of class "SnpMatrix". The columns are snps and the rows are the samples. See ?SnpMatrix details of the class structure.
  • mapA DataFrame giving the snp names and alleles at each locus. The ignore column indicates which variants were set to NA (see NA criteria in 'details' section).

itemize

  • GT : genotype, encoded as allele values separated by either of "/" or "|". The allele values are 0 for the reference allele and 1 for the alternate allele.

item

  • GL : genotype likelihoods comprised of comma separated floating point log10-scaled likelihoods for all possible genotypes. In the case of a reference allele A and a single alternate allele B, the likelihoods will be ordered "A/A", "A/B", "B/B".
  • PL : the phred-scaled genotype likelihoods rounded to the closest integer. The ordering of values is the same as for the GL field.
  • GP : the phred-scaled genotype posterior probabilities for all possible genotypes; intended to store imputed genotype probabilities. The ordering of values is the same as for the GL field.

code

uncertain=TRUE

Details

genotypeToSnpMatrix converts an array of genotype calls from the "GT", "GP", "GL" or "PL" FORMAT field of a VCF file into a SnpMatrix. The following caveats apply,
  • no distinction is made between phased and unphased genotypes
variants with >1 ALT allele are set to NA only single nucleotide variants are included; others are set to NA only diploid calls are included; others are set to NA

References

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

See Also

readVcf, VCF, SnpMatrix

Examples

Run this code
## ----------------------------------------------------------------
  ## Non-probability based snp encoding using "GT"
  ## ----------------------------------------------------------------
  fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") 
  vcf <- readVcf(fl, "hg19")

  ## This file has no "GL" or "GP" field so we use "GT".
  geno(vcf)

  ## Convert the "GT" FORMAT field to a SnpMatrix.
  mat <- genotypeToSnpMatrix(vcf)

  ## The result is a list of length 2.
  names(mat)

  ## Compare coding in the VCF file to the SnpMatrix.
  geno(vcf)$GT
  t(as(mat$genotype, "character"))

  ## The 'ignore' column in 'map' indicates which variants 
  ## were set to NA. Variant rs6040355 was ignored because 
  ## it has multiple alternate alleles, microsat1 is not a 
  ## snp, and chr20:1230237 has no alternate allele.
  mat$map

  ## ----------------------------------------------------------------
  ## Probability-based encoding using "GL", "PL" or "GP"
  ## ----------------------------------------------------------------
  ## Read a vcf file with a "GL" field.
  fl <- system.file("extdata", "gl_chr1.vcf", package="VariantAnnotation") 
  vcf <- readVcf(fl, "hg19")
  geno(vcf)

  ## Convert the "GL" FORMAT field to a SnpMatrix
  mat <- genotypeToSnpMatrix(vcf, uncertain=TRUE)

  ## Only 3 of the 9 variants passed the filters.  The
  ## other 6 variants had no alternate alleles.
  mat$map

  ## Compare genotype representations for a subset of
  ## samples in variant rs180734498.
  ## Original called genotype
  geno(vcf)$GT["rs180734498", 14:16]

  ## Original genotype likelihoods
  geno(vcf)$GL["rs180734498", 14:16]

  ## Posterior probability (computed inside genotypeToSnpMatrix)
  GLtoGP(geno(vcf)$GL["rs180734498", 14:16, drop=FALSE])[1,]

  ## SnpMatrix coding.
  t(as(mat$genotype, "character"))["rs180734498", 14:16]
  t(as(mat$genotype, "numeric"))["rs180734498", 14:16]

  ## For samples NA11829 and NA11830, one probability is significantly
  ## higher than the others, so SnpMatrix calls the genotype.  These
  ## calls match the original coding: "0|1" -> "A/B", "0|0" -> "A/A".
  ## Sample NA11831 was originally called as "0|1" but the probability
  ## of "0|0" is only a factor of 3 lower, so SnpMatrix calls it as
  ## "Uncertain" with an appropriate byte-level encoding.

Run the code above in your browser using DataLab