process_plink: Preprocess PLINK files using the `bigsnpr` package

Description

Preprocess PLINK files using the bigsnpr package

Usage

process_plink(
  data_dir,
  data_prefix,
  rds_dir = data_dir,
  rds_prefix = NULL,
  logfile = NULL,
  impute = TRUE,
  impute_method = "mode",
  id_var = "IID",
  parallel = TRUE,
  quiet = FALSE,
  overwrite = FALSE,
  ...
)

Value

The filepath to the .rds object created; see details for explanation.

Arguments

data_dir

The path to the bed/bim/fam data files, without a trailing "/" (e.g., use data_dir = '~/my_dir', not data_dir = '~/my_dir/')

data_prefix

The prefix (as a character string) of the bed/fam data files (e.g., data_prefix = 'mydata')

rds_dir

The path to the directory in which you want to create the new .rds and .bk files. Defaults to data_dir

rds_prefix

String specifying the user's preferred filename for the to-be-created .rds file (will be create inside rds_dir folder). If no rds_prefix is provided, the processed data files will be returned in memory. Note: rds_prefix cannot be the same as data_prefix

logfile

Optional: the name (character string) of the prefix of the logfile to be written in rds_dir. Default to NULL (no log file written). Note: do not append a .log to the filename; this is done automatically.

impute

Logical: should data be imputed? Default to TRUE.

impute_method

If impute = TRUE, this argument will specify the kind of imputation desired. Options are:

mode (default): Imputes the most frequent call. See bigsnpr::snp_fastImputeSimple() for details.
random: Imputes sampling according to allele frequencies.
mean0: Imputes the rounded mean.
mean2: Imputes the mean rounded to 2 decimal places.
xgboost: Imputes using an algorithm based on local XGBoost models. See bigsnpr::snp_fastImpute() for details. Note: this can take several minutes, even for a relatively small data set.

id_var

String specifying which column of the PLINK .fam file has the unique sample identifiers. Options are "IID" (default) and "FID"

parallel

Logical: should the computations within this function be run in parallel? Defaults to TRUE. See count_cores() and ?bigparallelr::assert_cores for more details. In particular, the user should be aware that too much parallelization can make computations slower.

quiet

Logical: should console messages be silenced? Defaults to FALSE

overwrite

Logical: if existing .bk/.rds files exist for the specified directory/prefix, should these be overwritten? Defaults to FALSE. Set to TRUE if you want to change the imputation method you're using, etc.

...

Optional: additional arguments to bigsnpr::snp_fastImpute() (relevant only if impute_method = 'xgboost')

Details

Three files are created in the location specified by rds_dir:

rds_prefix.rds: This is a list with three items: (1) X: the filebacked bigmemory::big.matrix object pointing to the imputed genotype data. This matrix has type double, which is important for downstream operations in create_design() (2) map: a data.frame with the PLINK bim data (i.e., the variant information) (3) fam: a data.frame with the PLINK fam data (i.e., the pedigree information)
rds_prefix.bk: This is the backing file that stores the numeric data of the genotype matrix.
rds_prefix.desc This is the description file, needed to attach the genotype matrix to the R session.

Note that process_plink() need only be run once for a given set of PLINK files; in subsequent data analysis/scripts, get_data() will access the .rds file.

For an example, see vignette on processing PLINK files.