read.cross: Read data for a QTL experiment

Description

Data for a QTL experiment is read from a set of files and converted into an object of class cross. The comma-delimited format (csv) is recommended. All formats require chromosome assignments for the genetic markers, and assume that markers are in their correct order.

Usage

read.cross(format=c("csv", "csvr", "csvs", "csvsr", "mm", "qtx",
                    "qtlcart", "gary", "karl"),
           dir="", file, genfile, mapfile, phefile, chridfile,
           mnamesfile, pnamesfile, na.strings=c("-","NA"),
           genotypes=c("A","H","B","D","C"), estimate.map=TRUE,
           convertXdata=TRUE, ...)

Arguments

format

Specifies the format of the data.

dir

Directory in which the data files will be found. In Windows, use forward slashes ("/") or double backslashes ("\\") to specify directory trees.

file

The main imput file for formats csv, csvr and mm.

genfile

File with genotype data (formats karl and gary only).

mapfile

File with marker position information (all formats except csv and csvr).

phefile

File with phenotype data (formats karl and gary only).

chridfile

File with chromosome ID for each marker (gary format only).

mnamesfile

File with marker names (gary format only).

pnamesfile

File with phenotype names (gary format only).

na.strings

A vector of strings which are to be interpreted as missing values (csv, csvr, and gary formats only). For the csv and csvr formats, these are interpreted globally for the entire

genotypes

A vector of character strings specifying the genotype codes (csv and csvr formats only). Generally this is a vector of length 5, with the elements corresponding to AA, AB, BB, not AA (i.e., AB or BB), and not BB (i

estimate.map

For formats csv, csvr, qtx, mm, and gary only: if TRUE and marker positions are not included in the input files, the genetic map is estimated using the function

convertXdata

If TRUE, any X chromosome genotype data is converted to the internal standard, using columns sex and pgm in the phenotype data if they available or by inference if they are not. If FALSE, the X chromsome data is read

...

Additional arguments, passed to the function read.table in the case of csv and csvr formats. In particular, one may use the argument sep to sp

Value

An object of class cross, which is a list with two components:
genoThis is a list with elements corresponding to chromosomes. names(geno) contains the names of the chromsomes. Each chromosome is itself a list, and is given class A or X according to whether it is autosomal or the X chromosome. There are two components for each chromosome: data, a matrix whose rows are individuals and whose columns are markers, and map, either a vector of marker positions (in cM) or a matrix of dim (2 x n.mar) where the rows correspond to marker positions in female and male genetic distance, respectively. The genotype data for a backcross is coded as follows: NA = missing, 1 = AA, 2 = AB. For an F2 intercross, the coding is NA = missing, 1 = AA, 2 = AB, 3 = BB, 4 = not BB (ie AA or AB; D in mapmaker/qtl), 5 = not AA (ie AB or BB; C in mapmaker/qtl). For a 4-way cross, the mother and father are assumed to have genotypes AB and CD, respectively. The genotype data for the progeny is assumed to be phase-known, with the following coding scheme: NA = missing, 1 = AC, 2 = BC, 3 = AD, 4 = BD, 5 = A = AC or AD, 6 = B = BC or BD, 7 = C = AC or BC, 8 = D = AD or BD, 9 = AC or BD, 10 = AD or BC.
phenodata.frame of size (n.ind x n.phe) containing the phenotypes.

X chromosome

The genotypes for the X chromosome require special care!

The phenotype data should contain a column named "sex" which indicates the sex of each individual, either coded as 0=female and 1=male, or as a factor with levels female/male or f/m. Case will be ignored both in the name and in the factor levels. If no such phenotype column is included, it will be assumed that all individuals are of the same sex.

In the case of an intercross, the phenotype data may also contain a column names "pgm" (for ``paternal grandmother'') indicating the direction of the cross. It should be coded as 0/1 with 0 indicating the cross (AxB)x(AxB) or (BxA)x(AxB) and 1 indicating the cross (AxB)x(BxA) or (BxA)x(BxA). If no such phenotype column is included, it will be assumed that all individuals come from the same direction of cross.

The internal storage of X chromosome data is quite different from that of autosomal data. Males are coded 1=AA and 2=BB; females with pgm==0 are coded 1=AA and 2=AB; and females with pgm==1 are coded 1=BB and 2=AB. If the argument convertXdata is TRUE, conversion to this format is made automatically; if FALSE, no conversion is done, summary.cross will likely return a warning, and most analyses will not work properly.

CSV format

The input file is a comma-delimited text file (a different field separator may be specified via the argument sep which will be passed to the function read.table).

The first line should contain the phenotype names followed by the marker names. At least one phenotype must be included; for example, include a numerical index for each individual.

The second line should contain blanks in the phenotype columns, followed by chromosome identifiers for each marker in all other columns. If a chromosome has the identifier X or x, it is assumed to be the X chromosome; otherwise, it is assumed to be an autosome.

An optional third line should contain blanks in the phenotype columns, followed by marker positions, in cM.

Marker order is taken from the cM positions, if provided; otherwise, it is taken from the column order.

Subsequent lines should give the data, with one line for each individual, and with phenotypes followed by genotypes. If possible, phenotypes are made numeric; otherwise they are converted to factors.

The cross is determined to be a backcross if only the first two elements of the genotypes string are found; otherwise, it is assumed to be an intercross.

CSVr format

This is just like the csv format, but rotated (or really transposed), so that rows are columns and columns are rows.

CSVs format

This is like the csv format, but with separate files for the genotype and phenotype data.

The first column in the genotype data must be specify individuals' identifiers, and there must be a column in the phenotype data with precisely the same information, and the individuals must be in precisely the same order in the two files.

In the genotype data file, the second row gives the chromosome IDs. The cell in the second row, first column, must be blank. A third row giving cM positions of markers may be included, in which case the cell in the third row, first column, must be blank.

There need be no blank rows in the phenotype data file.

CSVsr format

This is just like the csvs format, but rotated (or really transposed), so that rows are columns and columns are rows.

Details

The available formats are comma-delimited (csv), rotated comma-delimited (csvr), comma-delimited with separate files for genotype and phenotype data (csvs), rotated comma-delimited with separate files for genotype and phenotype data (csvsr, Mapmaker (mm), Map Manager QTX (qtx), Gary Churchill's format (gary) and Karl Broman's format (karl). The required files and their specification for each format appears below. The comma-delimited format is recommended. Note that these formats work only for backcross and intercross data.

The sampledata directory in the package distribution contains sample data files in all formats except Gary's.