Creates a '>BGData object from a .raw file (generated with --recodeA
in PLINK). Other text-based file
formats are supported as well by tweaking some of the parameters as long as
the records of individuals are in rows, and phenotypes, covariates and
markers are in columns.
readRAW(fileIn, header = TRUE, dataType = integer(), n = NULL,
p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
idCol = c(1L, 2L), nNodes = NULL, linked.by = "rows",
folderOut = paste0("BGData_", sub("\\.[[:alnum:]]+$", "",
basename(fileIn))), outputType = "byte", dimorder = if (linked.by ==
"rows") 2L:1L else 1L:2L, verbose = FALSE)readRAW_matrix(fileIn, header = TRUE, dataType = integer(), n = NULL,
p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
idCol = c(1L, 2L), verbose = FALSE)
readRAW_big.matrix(fileIn, header = TRUE, dataType = integer(),
n = NULL, p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
idCol = c(1L, 2L), folderOut = paste0("BGData_",
sub("\\.[[:alnum:]]+$", "", basename(fileIn))), outputType = "char",
verbose = FALSE)
The path to the plaintext file.
Whether fileIn
contains a header. Defaults to TRUE
.
The coding type of genotypes in fileIn
. Use integer()
or
double()
for numeric coding. Alpha-numeric coding is currently not
supported for readRAW()
and readRAW_big.matrix()
: use the --recodeA
option of PLINK to convert the .ped file into a .raw file. Defaults to
integer()
.
The number of individuals. Auto-detect if NULL
. Defaults to
NULL
.
The number of markers. Auto-detect if NULL
. Defaults to NULL
.
The field separator character. Values on each line of the file
are separated by this character. If sep = ""
(the default for readRAW()
the separator is "white space", that is one or more spaces, tabs, newlines
or carriage returns.
The character string used in the plaintext file to denote
missing value. Defaults to NA
.
The number of columns to be skipped to reach the genotype
information in the file. Defaults to 6
.
The index of the ID column. If more than one index is given,
both columns will be concatenated with "_". Defaults to c(1, 2)
, i.e. a
concatenation of the first two columns.
The number of nodes to create. Auto-detect if NULL
. Defaults
to NULL
.
If columns
a column-linked matrix
(LinkedMatrix::ColumnLinkedMatrix) is created, if rows
a
row-linked matrix (LinkedMatrix::RowLinkedMatrix). Defaults to
rows
.
The path to the folder where to save the binary files.
Defaults to the name of the input file (fileIn
) without extension prefixed
with "BGData_".
The vmode
for ff
and type
for
bigmemory::big.matrix) objects. Default to byte
for ff
and
char
for bigmemory::big.matrix objects.
The physical layout of the underlying ff
object of each
node.
Whether progress updates will be posted. Defaults to FALSE
.
Genotypes are stored in a LinkedMatrix::LinkedMatrix object where
each node is an ff
instance. Multiple ff
files are used because the
array size in ff
is limited to the largest integer which can be
represented on the system (.Machine$integer.max
) and for genetic data this
limitation is often exceeded. The LinkedMatrix::LinkedMatrix package
makes it possible to link several ff
files together by columns or by rows
and treat them similarly to a single matrix. By default a
LinkedMatrix::ColumnLinkedMatrix is used for @geno
, but the user
can modify this using the linked.by
argument. The number of nodes to
generate is either specified by the user using the nNodes
argument or
determined internally so that each ff
object has a number of cells that is
smaller than .Machine$integer.max / 1.2
. A folder (see folderOut
) that
contains the binary flat files (named geno_*.bin
) and an external
representation of the '>BGData object in BGData.RData
is created.
Genotypes are stored in a regular matrix
object. Therefore, this function
will only work if the .raw file is small enough to fit into memory.
Genotypes are stored in a filebacked bigmemory::big.matrix object.
A folder (see folderOut
) that contains the binary flat file (named
BGData.bin
), a descriptor file (named BGData.desc
), and an external
representation of the '>BGData object in BGData.RData
are created.
To reload a '>BGData object, it is recommended to use the
load.BGData()
function instead of the base::load()
function as
base::load()
does not initialize ff
objects or attach
bigmemory::big.matrix objects.
The data included in the first couple of columns (up to nColSkip
) is used
to populate the @pheno
slot of a '>BGData object, and the remaining
columns are used to fill the @geno
slot. If the first row contains a
header (header = TRUE
), data in this row is used to determine the column
names for @pheno
and @geno
.
@geno
can take several forms, depending on the function that is called
(readRAW
, readRAW_matrix
, or readRAW_big.matrix
). The following
sections illustrate each function in detail.
load.BGData()
to load a previously saved '>BGData object,
as.BGData()
to create '>BGData objects from non-text files (e.g. BED
files).
# NOT RUN {
# Path to example data
path <- system.file("extdata", package = "BGData")
# Convert RAW files of chromosome 1 to a BGData object
bg <- readRAW(fileIn = paste0(path, "/chr1.raw"))
# }
Run the code above in your browser using DataCamp Workspace