readRAW: Creates a BGData Object From a .raw File or a .ped-Like File.

Description

Creates a '>BGData object from a .raw file (generated with --recodeA in PLINK). Other text-based file formats are supported as well by tweaking some of the parameters as long as the records of individuals are in rows, and phenotypes, covariates and markers are in columns.

Usage

readRAW(fileIn, header = TRUE, dataType = integer(), n = NULL,
  p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), nNodes = NULL, linked.by = "rows",
  folderOut = paste0("BGData_", sub("\\.[[:alnum:]]+$", "",
  basename(fileIn))), outputType = "byte", dimorder = if (linked.by ==
  "rows") 2L:1L else 1L:2L, verbose = FALSE)
readRAW_matrix(fileIn, header = TRUE, dataType = integer(), n = NULL,
  p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), verbose = FALSE)
readRAW_big.matrix(fileIn, header = TRUE, dataType = integer(),
  n = NULL, p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
  idCol = c(1L, 2L), folderOut = paste0("BGData_",
  sub("\\.[[:alnum:]]+$", "", basename(fileIn))), outputType = "char",
  verbose = FALSE)

Arguments

fileIn

The path to the plaintext file.

header

Whether fileIn contains a header. Defaults to TRUE.

dataType

The coding type of genotypes in fileIn. Use integer() or double() for numeric coding. Alpha-numeric coding is currently not supported for readRAW() and readRAW_big.matrix(): use the --recodeA option of PLINK to convert the .ped file into a .raw file. Defaults to integer().

The number of individuals. Auto-detect if NULL. Defaults to NULL.

The number of markers. Auto-detect if NULL. Defaults to NULL.

sep

The field separator character. Values on each line of the file are separated by this character. If sep = "" (the default for readRAW() the separator is "white space", that is one or more spaces, tabs, newlines or carriage returns.

na.strings

The character string used in the plaintext file to denote missing value. Defaults to NA.

nColSkip

The number of columns to be skipped to reach the genotype information in the file. Defaults to 6.

idCol

The index of the ID column. If more than one index is given, both columns will be concatenated with "_". Defaults to c(1, 2), i.e. a concatenation of the first two columns.

nNodes

The number of nodes to create. Auto-detect if NULL. Defaults to NULL.

linked.by

If columns a column-linked matrix (LinkedMatrix::ColumnLinkedMatrix) is created, if rows a row-linked matrix (LinkedMatrix::RowLinkedMatrix). Defaults to rows.

folderOut

The path to the folder where to save the binary files. Defaults to the name of the input file (fileIn) without extension prefixed with "BGData_".

outputType

The vmode for ff and type for bigmemory::big.matrix) objects. Default to byte for ff and char for bigmemory::big.matrix objects.

dimorder

The physical layout of the underlying ff object of each node.

verbose

Whether progress updates will be posted. Defaults to FALSE.

readRAW

Genotypes are stored in a LinkedMatrix::LinkedMatrix object where each node is an ff instance. Multiple ff files are used because the array size in ff is limited to the largest integer which can be represented on the system (.Machine$integer.max) and for genetic data this limitation is often exceeded. The LinkedMatrix::LinkedMatrix package makes it possible to link several ff files together by columns or by rows and treat them similarly to a single matrix. By default a LinkedMatrix::ColumnLinkedMatrix is used for @geno, but the user can modify this using the linked.by argument. The number of nodes to generate is either specified by the user using the nNodes argument or determined internally so that each ff object has a number of cells that is smaller than .Machine$integer.max / 1.2. A folder (see folderOut) that contains the binary flat files (named geno_*.bin) and an external representation of the '>BGData object in BGData.RData is created.

readRAW_matrix

Genotypes are stored in a regular matrix object. Therefore, this function will only work if the .raw file is small enough to fit into memory.

readRAW_big.matrix

Genotypes are stored in a filebacked bigmemory::big.matrix object. A folder (see folderOut) that contains the binary flat file (named BGData.bin), a descriptor file (named BGData.desc), and an external representation of the '>BGData object in BGData.RData are created.

Reloading a BGData object

To reload a '>BGData object, it is recommended to use the load.BGData() function instead of the base::load() function as base::load() does not initialize ff objects or attach bigmemory::big.matrix objects.

Details

The data included in the first couple of columns (up to nColSkip) is used to populate the @pheno slot of a '>BGData object, and the remaining columns are used to fill the @geno slot. If the first row contains a header (header = TRUE), data in this row is used to determine the column names for @pheno and @geno.

@geno can take several forms, depending on the function that is called (readRAW, readRAW_matrix, or readRAW_big.matrix). The following sections illustrate each function in detail.

Examples

Run this code

# NOT RUN {
# Path to example data
path <- system.file("extdata", package = "BGData")

# Convert RAW files of chromosome 1 to a BGData object
bg <- readRAW(fileIn = paste0(path, "/chr1.raw"))
# }

Run the code above in your browser using DataLab