genotypeMatrix-methods: Constructors for Creating `GenotypeMatrix` Objects

Description

Create GenotypeMatrix object from (sparse) matrix object and positions of variants

Usage

## S3 method for class 'ANY,GRanges,missing':
genotypeMatrix(Z, pos, seqnames,
       ploidy=2, na.string=NULL, na.limit=1, MAF.limit=1,
       na.action=c("impute.major", "omit", "fail"),
       MAF.action=c("invert", "omit", "ignore", "fail"),
       sex=NULL)
## S3 method for class 'ANY,numeric,character':
genotypeMatrix(Z, pos, seqnames, ...)
## S3 method for class 'ANY,character,missing':
genotypeMatrix(Z, pos, seqnames, ...)
## S3 method for class 'ANY,missing,missing':
genotypeMatrix(Z, pos, seqnames, subset,
       noIndels=TRUE, onlyPass=TRUE, sex=NULL, ...)
## S3 method for class 'eSet,numeric,character':
genotypeMatrix(Z, pos, seqnames, ...)
## S3 method for class 'eSet,character,missing':
genotypeMatrix(Z, pos, seqnames, ...)
## S3 method for class 'eSet,character,character':
genotypeMatrix(Z, pos, seqnames, ...)

Arguments

an object of class dgCMatrix, a numeric matrix, a character matrix, an object of class VCF, or an object of class eSet (see details below)

pos

an object of class GRanges, a numeric vector, or a character vector (see details below)

seqnames

a character vector (see details below)

ploidy

determines the ploidy of the genome for the computation of minor allele frequencies (MAFs) and the possible inversion of columns with an MAF exceeding 0.5; the elements of Z may not exceed this value.

subset

a numeric vector with indices or a character vector with names of samples to restrict to

na.limit

all columns with a missing value ratio above this threshold will be omitted from the output object.

MAF.limit

all columns with an MAF above this threshold will be omitted from the output object.

na.action

if impute.major, all missing values will be imputed by major alleles in the output object. If omit, all columns containing missing values will be omitted in the output object. If fail, the function stops with an error if Z contains any missing values.

MAF.action

if invert, all columns with an MAF exceeding 0.5 will be inverted in the sense that all minor alleles will be replaced by major alleles and vice versa. For numerical Z, this is accomplished by subtracting the column from the ploidy value. If omit, all columns with an MAF greater than 0.5 are omitted in the output object. If ignore, no action is taken and MAFs greater than 0.5 are kept as they are. If fail, the function stops with an error if Z contains any column with an MAF greater than 0.5.

noIndels

if TRUE (default), only single nucleotide variants (SNVs) are considered and indels are skipped; only works if the ALT column is present in the VCF object Z, otherwise a warning is shown and the noIndels argument is ignored.

onlyPass

if TRUE (default), only variants are considered whose value in the FILTER column is PASS; only works if the FILTER column is present in the VCF object Z, otherwise a warning is shown and the onlyPass argument is ignored.

na.string

if not NULL, all . entries in the character matrix or VCF genotype are replaced with this string before parsing the matrix.

sex

if NULL, all rows of Z are treated the same without any modifications; if sex is a factor with levels F (female) and M (male) that is as long as Z has rows, this argument is interpreted as the sex of the samples. In this case, the rows corresponding to male samples are doubled before further processing. This is designed for mixed-sex analyses of the X chromosome outside of the pseudoautosomal regions.

...

all additional arguments are passed on internally to the genotypeMatrix method with signature ANY,GRanges,missing.

Value

returns an object of class GenotypeMatrix

code

featureData(Z)[[seqnames]]

pkg

VariantAnnotation
beadarraySNP

dQuote

Details

This method provides different ways of constructing an object of class GenotypeMatrix from other types of objects. The typical case is when a matrix object is combined with positional information. The first three variants listed above work with Z being a dgCMatrix object, a numeric matrix, or a character matrix.

If Z is a dgCMatrix object or a matrix, rows are interpreted as samples and columns are interpreted as variants. For dgCMatrix objects and numeric matrices, matrix entries are interpreted as the numbers of minor alleles (with 0 meaning only major alleles). In this case, minor allele frequencies (MAFs) are computed as column sums divided by the number of alleles, i.e. the number of samples/rows multiplied by the ploidy parameter. If Z is a character matrix, the matrix entries need to comply to the format of the GT field in VCF files. MAFs are computed as the actual relative frequency of minor alleles among all alleles in a column. For a diploid genome, therefore, this results in the same MAF estimate as mentioned above. However, some VCF readers, most importantly readVcf from the VariantAnnotation package, replace missing genotypes by a single . even for non-haploid genomes, which would result in a wrong MAF estimate. To correct for this, the na.string parameter is available. If not NULL, all . entries in the matrix are replaced by na.string before parsing the matrix. The correct setting for a diploid genome would be ./..

Positional information can be passed to the function in three different ways:

by supplying aGRangesobject asposargument and omitting theseqnamesargument,

by supplying a numeric vector of positions as pos argument and sequence/chromosome names as seqnames argument, or by supplying a character vector with entries of the format seqname:pos as pos argument and omitting the seqnames argument.

References

http://www.bioinf.jku.at/software/podkat

http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42

Obenchain, V., Lawrence, M., Carey, V., Gogarten, S., Shannon, P., and Morgan, M. (2014) VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30, 2076-2078.

Examples

Run this code

## create a toy example
A <- matrix(rbinom(50, 2, prob=0.2), 5, 10)
sA <- as(A, "dgCMatrix")
pos <- sort(sample(1:10000, ncol(A)))
seqname <- "chr1"

## variant with 'GRanges' object
gr <- GRanges(seqnames=seqname, ranges=IRanges(start=pos, width=1))
gtm <- genotypeMatrix(A, gr)
gtm
as.matrix(gtm)
variantInfo(gtm)
MAF(gtm)

## variant with 'pos' and 'seqnames' object
genotypeMatrix(sA, pos, seqname)

## variant with 'seqname:pos' strings passed through 'pos' argument
spos <- paste(seqname, pos, sep=":")
spos
genotypeMatrix(sA, spos)

## read data from VCF file using 'readVcf()' from the 'VariantAnnotation'
## package
if (require(VariantAnnotation))
{
    vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
    sp <- ScanVcfParam(info=NA, genome="GT", fixed=c("ALT", "FILTER"))
    vcf <- readVcf(vcfFile, genome="hgA", param=sp)
    rowRanges(vcf)

    ## call constructor for 'VCF' object
    gtm <- genotypeMatrix(vcf)
    gtm
    variantInfo(gtm)

    ## alternatively, extract information from 'VCF' object and use
    ## variant with character matrix and 'GRanges' positions
    ## note that, in 'VCF' objects, rows correspond to variants and
    ## columns correspond to samples, therefore, we have to transpose the
    ## genotype
    gt <- t(geno(vcf)$GT)
    gt[1:5, 1:5]
    gr <- rowRanges(vcf)
    gtm <- genotypeMatrix(gt, gr)
    as.matrix(gtm[1:20, 1:5, recomputeMAF=TRUE])
}

Run the code above in your browser using DataLab