write.Structure: Write Genotypes in Structure 2.3 Format

Description

Given genotypes in the form of a two-dimensional list of vectors, write.Structure produces a text file of the genotypes in a format readable by Structure 2.2 and higher. The user specifies the overall ploidy of the file as well as the ploidy of each sample.

Usage

write.Structure(gendata, ploidy, file="",
samples=dimnames(gendata)[[1]], loci=dimnames(gendata)[[2]],
indploidies=rep(ploidy,times=length(samples)),
extracols=NULL, missing=-9)

Arguments

gendata

Genotypes stored as a two-dimensional list of vectors, in the standard polysat format.

ploidy

PLOIDY for Structure, i.e. how many rows per individual to write.

file

A character string of the file path where the file should be written to.

samples

An optional character vector listing the names of samples to be written to the file.

loci

An optional character vector listing the names of the loci to be written to the file.

indploidies

An integer vector containing the ploidy of each sample. names(indploidies) should correspond to samples, or if the vector is unnamed it is assumed to be in the same order as samples.

extracols

An array, with the first dimension names corresponding to samples, of PopData, PopFlag, LocData, Phenotype, or other values to be included in the extra columns in the file.

missing

The number used to indicate missing data.

Value

No value is returned, but instead a file is written at the path specified.

Details

Structure 2.2 and higher can process polyploid microsatellite data, although 2.3.3 or higher is recommended for its improvements on polyploid handling. The input format of Structure requires that each locus take up one column and that each individual take up as many rows as the parameter PLOIDY. Because of the multiple rows per sample, each sample name must be duplicated, as well as any population, location, or phenotype data. Partially heterozygous genotypes also must have one arbitrary allele duplicated up to the ploidy of the sample, and samples that have a lower ploidy than that used in the file (for mixed polyploid data sets) must have a missing data symbol inserted up to fill in the extra rows. Additionally, if some samples have more alleles than PLOIDY (if you are using a lower PLOIDY to save processing time, or if there are extra alleles from scoring errors), some alleles must be randomly removed from the data. write.Structure performs this duplication, insertion, and random deletion of data. The argument samples contains all of the sample names to be written to the file, and is used to index gendata, indploidies, and extracols. These sample names will also be used as row names in the Structure file. Each sample name should only be in the vector sample once, because write.Structure will duplicate the sample names a number of times as dictated by ploidy. Likewise, indploidies and extracols only need to contain data for each sample once. If samples isn't specified by the user it will be extracted from gendata. In writing genotypes to the file, write.Structure compares the number of alleles in the genotype, the ploidy of the sample as stored in indploidies, and the ploidy of the file as stored in ploidy, and does one of six things (for a given sample x and locus loc): 1) If indploidies[x] is greater than or equal to ploidy, and length(gendata[[x,loc]]) is equal to ploidy, the genotype data is used as is. 2) If indploidies[x] is greater than or equal to ploidy, and length(gendata[[x,loc]]) is less than ploidy, the first allele is duplicated as many times as necessary for there to be as many alleles as ploidy. 3) If indploidies[x] is greater than or equal to ploidy, and length(gendata[[x,loc]]) is greater than ploidy, a random sample of the alleles, without replacement, is used as the genotype. 4) If indploidies[x] is less than ploidy, and length(gendata[[x,loc]]) is equal to indploidies[x], the genotype data is used as is and missing data symbols are inserted in the extra rows. 5) If indploidies[x] is less than ploidy, and length(gendata[[x,loc]]) is less than indploidies[x], the first allele is duplicated as many times as necessary for there to be as many alleles as indploidies[x], and missing data symbols are inserted in the extra rows. 6) If indploidies[x] is less than ploidy, and length(gendata[[x,loc]]) is greater than indploidies[x], a random sample, without replacement, of indploidies[x] alleles is used, and missing data symbols are inserted in the extra rows. (Alleles are removed even though there is room for them in the file.) Two of the header rows that are optional for Structure are written by write.Structure. These are Marker Names, containing the names of loci supplied in gendata, and Recessive Alleles, which contains the missing data symbol once for each locus. This indicates to the program that all alleles are codominant with copy number ambiguity. The output file requires a few small modifications, done in a text editor or spreadsheet software, in order to be read by Structure. In the upper left corner the words rowlabel and missing should be deleted. Likewise the first and second rows for any non-locus columns should be deleted if the extracols argument was used. These should include the second dimension names used in extracols, and zeros, respectively.

References

http://pritch.bsd.uchicago.edu/structure_software/release_versions/v2.3.3/structure_doc.pdf Hubisz, M. J., Falush, D., Stephens, M. and Pritchard, J. K. (2009) Inferring weak population structure with the assistance of sample group information. Molecular Ecology Resources 9, 1322-1332. Falush, D., Stephens, M. and Pritchard, J. K. (2007) Inferences of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes 7, 574-578.

Examples

Run this code

# input genotype data (this is usually done by reading a file)
mygendata <- array(list(c(100,102,106,108,114,118),c(102,110),
                      c(98,100,104,108,110,112,116),c(102,106,112,118),
                      c(104,108,110),c(-9),
                      c(204),c(206,208,210,212,220,224,226),
                      c(202,206,208,212,214,218),c(200,204,206,208,212),
                      c(-9),c(202,206)),
                 dim=c(6,2), dimnames=list(c("ind1","ind2","ind3",
                                             "ind4","ind5","ind6"),
                                           c("locus1","locus2")))

# create a vector of sample names to be used.  Note that this excludes
#  ind6.
# Also note that this could be obtained as names(mygendata[[1]]).
mysamples <- c("ind1","ind2","ind3","ind4","ind5")

# create a vector of the ploidy of each sample.
# Note that some of the above genotypes have more or fewer alleles than
# the ploidy of the sample.
myploidies <- c(6,6,6,4,4)
names(myploidies) <- mysamples

# Create an array containing data for additional columns to be written
# to the file.  You might also prefer to just read this and the ploidies
# in from a file.
myexcols <- array(data=c(1,2,1,2,1,1,1,0,0,0),dim=c(5,2),
                  dimnames=list(mysamples, c("PopData","PopFlag")))

# Write the Structure file, with six rows per individual.
# Since outfile="", the data will be written to the console instead of a file.
write.Structure(mygendata, 6, "", samples=mysamples, indploidies=myploidies,
                extracols=myexcols)

Run the code above in your browser using DataLab