Learn R Programming

vcfR (version 1.4.0)

vcfR2DNAbin: Convert vcfR to DNAbin

Description

Convert objects of class vcfR to objects of class ape::DNAbin

Usage

vcfR2DNAbin(x, extract.indels = TRUE, consensus = FALSE,
  extract.haps = TRUE, unphased_as_NA = TRUE, ref.seq = NULL,
  start.pos = NULL, verbose = TRUE)

Arguments

x

an object of class chromR or vcfR

extract.indels

logical, at present, the only option is TRUE

consensus

logical, at present, the only option is TRUE

extract.haps

logical specifying whether to separate each genotype into alleles based on a delimiting character

unphased_as_NA

logical indicating how to handle alleles in unphased genotypes

ref.seq

reference sequence (DNAbin) for the region being converted

start.pos

chromosomal position for the start of the ref.seq

verbose

logical specifying whether to produce verbose output

Details

Objects of class DNAbin, from the package ape, store nucleotide sequence information. Typically, nucleotide sequence information contains all the nucleotides within a region, for example, a gene. Because most sites are typically invariant, this results in a large amount of redundant data. This is why files in the vcf format only contain information on variant sites, it results in a smaller file. Nucleotide sequences can be generated which only contain variant sites. However, some applications require the invariant sites. For example, inference of phylogeny based on maximum likelihood or Bayesian methods requires invariant sites. The function vcfR2DNAbin therefore includes a number of options in attempt to accomodate various scenarios.

The presence of indels (insertions or deletions)in a sequence typically presents a data analysis problem. Mutation models typically do not accomodate this data well. For now, the only option is for indels to be omitted from the conversion of vcfR to DNAbin objects. The option extract.indels was included to remind us of this, and to provide a placeholder in case we wish to address this in the future.

The ploidy of the samples is inferred from the first non-missing genotype. All samples and all variants within each sample are assumed to be of the same ploid.

Conversion of haploid data is fairly straight forward. The options consensus and extract.haps are not relevant here. When vcfR2DNAbin encounters missing data in the vcf data (NA) it is coded as an ambiguous nucleotide (n) in the DNAbin object. When no reference sequence is provided (option ref.seq), a DNAbin object consisting only of variant sites is created. When a reference sequence and a starting position are provided the entire sequence, including invariant sites, is returned. The reference sequence is used as a starting point and variable sitees are added to this. Because the data in the vcfR object will be using a chromosomal coordinate system, we need to tell the function where on this chromosome the reference sequence begins.

Conversion of diploid data presents a number of scenarios. When the option consensus is TRUE, each genotype is split into two alleles and the two alleles are converted into their IUPAC ambiguity code. This results in one sequence for each diploid sample. This may be an appropriate path when you have unphased data. Note that functions called downstream of this choice may handle IUPAC ambiguity codes in unexpected manners. When extract.haps is set to TRUE, each genotype is split into two alleles. These alleles are inserted into two sequences. This results in two sequences per diploid sample. Note that this really only makes sense if you have phased data. The options ref.seq and start.pos are used as in halpoid data.

Conversion of polyploid data is currently not supported. However, I have made some attempts at accomodating polyploid data. If you have polyploid data and are interested in giving this a try, feel free. But be prepared to scrutinize the output to make sure it appears reasonable.

Creation of DNAbin objects from large chromosomal regions may result in objects which occupy large amounts of memory. If in doubt, begin by subsetting your data and the scale up to ensure you do not run out of memory.

See Also

ape

Examples

Run this code
# NOT RUN {
library(ape)
data(vcfR_test)

# Create an example reference sequence.
nucs <- c('a','c','g','t')
set.seed(9)
myRef <- as.DNAbin(matrix(nucs[round(runif(n=20, min=0.5, max=4.5))], nrow=1))

# Recode the POS data for a smaller example.
set.seed(99)
vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test))))

# Just vcfR
myDNA <- vcfR2DNAbin(vcfR_test)
seg.sites(myDNA)
image(myDNA)

# ref.seq, no start.pos
myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef)
seg.sites(myDNA)
image(myDNA)

# ref.seq, start.pos = 4.
# Note that this is the same as the previous example but the variants are shifted.
myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef, start.pos = 4)
seg.sites(myDNA)
image(myDNA)

# ref.seq, no start.pos, unphased_as_NA = FALSE
myDNA <- vcfR2DNAbin(vcfR_test, unphased_as_NA = FALSE, ref.seq = myRef)
seg.sites(myDNA)
image(myDNA)



# }

Run the code above in your browser using DataLab