bigsnpr packagePreprocess PLINK files using the bigsnpr package
process_plink(
data_dir,
data_prefix,
rds_dir = data_dir,
rds_prefix = NULL,
logfile = NULL,
impute = TRUE,
impute_method = "mode",
id_var = "IID",
parallel = TRUE,
quiet = FALSE,
overwrite = FALSE,
...
)The filepath to the .rds object created; see details for explanation.
The path to the bed/bim/fam data files, without a trailing "/" (e.g., use data_dir = '~/my_dir', not data_dir = '~/my_dir/')
The prefix (as a character string) of the bed/fam data files (e.g., data_prefix = 'mydata')
The path to the directory in which you want to create the new .rds and .bk files. Defaults to data_dir
String specifying the user's preferred filename for the to-be-created .rds file (will be create inside rds_dir folder). If no rds_prefix is provided, the processed data files will be returned in memory.
Note: rds_prefix cannot be the same as data_prefix
Optional: the name (character string) of the prefix of the logfile to be written in rds_dir. Default to NULL (no log file written). Note: do not append a .log to the filename; this is done automatically.
Logical: should data be imputed? Default to TRUE.
If impute = TRUE, this argument will specify the kind of imputation desired. Options are:
mode (default): Imputes the most frequent call. See bigsnpr::snp_fastImputeSimple() for details.
random: Imputes sampling according to allele frequencies.
mean0: Imputes the rounded mean.
mean2: Imputes the mean rounded to 2 decimal places.
xgboost: Imputes using an algorithm based on local XGBoost models. See bigsnpr::snp_fastImpute() for details. Note: this can take several minutes, even for a relatively small data set.
String specifying which column of the PLINK .fam file has the unique sample identifiers. Options are "IID" (default) and "FID"
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See count_cores() and ?bigparallelr::assert_cores for more details.
In particular, the user should be aware that too much parallelization can make computations slower.
Logical: should console messages be silenced? Defaults to FALSE
Logical: if existing .bk/.rds files exist for the specified directory/prefix, should these be overwritten? Defaults to FALSE. Set to TRUE if you want to change the imputation method you're using, etc.
Optional: additional arguments to bigsnpr::snp_fastImpute() (relevant only if impute_method = 'xgboost')
Three files are created in the location specified by rds_dir:
rds_prefix.rds: This is a list with three items:
(1) X: the filebacked bigmemory::big.matrix object pointing to the imputed genotype data.
This matrix has type double, which is important for downstream operations in create_design()
(2) map: a data.frame with the PLINK bim data (i.e., the variant information)
(3) fam: a data.frame with the PLINK fam data (i.e., the pedigree information)
rds_prefix.bk: This is the backing file that stores the numeric data of the genotype matrix.
rds_prefix.desc This is the description file, needed to attach the genotype matrix to the R session.
Note that process_plink() need only be run once for a given set of PLINK
files; in subsequent data analysis/scripts, get_data() will access the .rds file.
For an example, see vignette on processing PLINK files.