vcfR (version 1.9.0)

ordisample: Ordinate a sample's data

Description

Ordinate information from a sample's GT region and INFO column.

Usage

ordisample(
  x,
  sample,
  distance = "bray",
  plot = TRUE,
  alpha = 88,
  verbose = TRUE,
  ...
)

Arguments

x

an object of class vcfR or chromR.

sample

a sample number where the first sample (column) is 2

distance

metric to be used for ordination, options are in vegdist

plot

logical specifying whether to plot the ordination

alpha

alpha channel (transparency) ranging from 0-255

verbose

logical specifying whether to produce verbose output

...

parameters to be passed to child processes

Value

A list consisting of two objects.

  • an object of class 'metaMDS' created by the function vegan::metaMDS

  • an object of class 'envfit' created by the function vegan::envfit

This list is returned invisibly.

Details

The INFO column of VCF data contains descriptors for each variant. Each sample typically includes several descriptors of each variant in the GT region as well. This can present an overwhelming amount of information. Ordination is used in this function to reduce this complexity.

The ordination procedure can be rather time consuming depending on how much data is used. I good recommendation is to always start with a small subset of your full dataset and slowly scale up. There are several steps in this function that attempt to eliminate variants or characters that have missing values in them. This that while starting with a small number is good, you will need to have a large enough number so that a substantial amount of the data make it to the ordination step. In the example I use 100 variants which appears to be a reasonable compromise.

The data contained in VCF files can frequently contain a large fraction of missing data. I advovate censoring data that does not meet quality control thresholds as missing which compounds the problem. An attempt is made to omit these missing data by querying the GT and INFO data for missingness and omitting the missing variants. The data may also include characters (columns) that contain all missing values which are omitted as well. When verbose == TRUE these omissions are reported as messages.

Some data may contain multiple values. For example, AD is the sequence depth for each observed allele. In these instances the values are sorted and the largest value is used.

Several of the steps of this ordination make distributional assumptions. That is, they assume the data to be normally distributed. There is no real reason to assume this assumption to be valid with VCF data. It has been my experience that this assumption is frequently violated with VCF data. It is therefore suggested to use this funciton as an exploratory tool that may help inform other decisions. These analyst may be able to address these issues through data transformation or other topics beyond the scope of this function. This function is intended to provide a rapid assessment of the data which may help determine if more elegant handling of the data may be required. Interpretation of the results of this function need to take into account that assumptions may have been violated.

See Also

metaMDS, vegdist, monoMDS, isoMDS

Examples

Run this code
# NOT RUN {
# Example of normally distributed, random data.
set.seed(9)
x1 <- rnorm(500)
set.seed(99)
y1 <- rnorm(500)
plot(x1, y1, pch=20, col="#8B451388", main="Normal, random, bivariate data")

data(vcfR_example)
ordisample(vcf[1:100,], sample = "P17777us22")

vars <- 1:100
myOrd <- ordisample(vcf[vars,], sample = "P17777us22", plot = FALSE)
names(myOrd)
plot(myOrd$metaMDS, type = "n")
points(myOrd$metaMDS, display = "sites", pch=20, col="#8B451366")
text(myOrd$metaMDS, display = "spec", col="blue")
plot(myOrd$envfit, col = "#008000", add = TRUE)
head(myOrd$metaMDS$points)
myOrd$envfit
pairs(myOrd$data1)

# Seperate heterozygotes and homozygotes.
gt <- extract.gt(vcf)
hets <- is_het(gt, na_is_false = FALSE)
vcfhe <- vcf
vcfhe@gt[,-1][ !hets & !is.na(hets)  ] <- NA
vcfho <- vcf
vcfho@gt[,-1][ hets & !is.na(hets) ] <- NA

myOrdhe <- ordisample(vcfhe[vars,], sample = "P17777us22", plot = FALSE)
myOrdho <- ordisample(vcfho[vars,], sample = "P17777us22", plot = FALSE)
pairs(myOrdhe$data1)
pairs(myOrdho$data1)
hist(myOrdho$data1$PL, breaks = seq(0,9000, by=100), col="#8B4513")
# }
# NOT RUN {

# }

Run the code above in your browser using DataCamp Workspace