vcfR2DNAbin(x, extract.indels = TRUE, consensus = FALSE,
extract.haps = TRUE, gt.split = "|", ref.seq = NULL, start.pos = NULL,
verbose = TRUE)
The presence of indels (insertions or deletions)in a sequence typically presents a data analysis problem. Mutation models typically do not accomodate this data well. For now, the only option is for indels to be omitted from the conversion of vcfR to DNAbin objects. The option extract.indels was included to remind us of this, and to provide a placeholder in case we wish to address this in the future.
The ploidy of the samples is inferred from the first non-missing genotype.
The option gt.split
is used to split this genotype into alleles and these are counted.
Values for gt.split
are typically '|' for phased data or '/' for unphased data.
Note that this option is an exact match and not used in a regular expression, as the 'sep' parameter in vcfR2genind
is used.
All samples and all variants within each sample are assumed to be of the same ploid.
Conversion of haploid data is fairly straight forward.
The options consensus
, extract.haps
and gt.split
are not relevant here.
When vcfR2DNAbin encounters missing data in the vcf data (NA) it is coded as an ambiguous nucleotide (n) in the DNAbin object.
When no reference sequence is provided (option ref.seq
), a DNAbin object consisting only of variant sites is created.
When a reference sequence and a starting position are provided the entire sequence, including invariant sites, is returned.
The reference sequence is used as a starting point and variable sitees are added to this.
Because the data in the vcfR object will be using a chromosomal coordinate system, we need to tell the function where on this chromosome the reference sequence begins.
Conversion of diploid data presents a number of scenarios.
When the option consensus
is TRUE, each genotype is split into two alleles using gt.split and the two alleles are converted into their IUPAC ambiguity code.
This results in one sequence for each diploid sample.
This may be an appropriate path when you have unphased data.
Note that functions called downstream of this choice may handle IUPAC ambiguity codes in unexpected manners.
When extract.haps is set to TRUE, each genotype is split into two alleles using gt.split.
These alleles are inserted into two sequences.
Thsi results in two sequences per diploid sample.
Note that this really only makes sense if you have phased data.
The options ref.seq and start.pos are used as in halpoid data.
Conversion of polyploid data is currently not supported. However, I have made some attempts at accomodating polyploid data. If you have polyploid data and are interested in giving this a try, feel free. But be prepared to scrutinize the output to make sure it appears reasonable.
Creation of DNAbin objects from large chromosomal regions may result in objects which occupy large amounts of memory. If in doubt, begin by subsetting your data and the scale up to ensure you do not run out of memory.