getSparseGRM: Make a `SparseGRMFile` for `GRAB.NullModel`.

Description

If the sample size in analysis is greater than 100,000, we recommend using sparse GRM (instead of dense GRM) to adjust for sample relatedness. This function is to use GCTA (link) to make a SparseGRMFile to be passed to function GRAB.NullModel. This function can only support Linux and PLINK files as required by GCTA software. To make a SparseGRMFile, two steps are needed. Please check Details section for more details.

Usage

getSparseGRM(
  PlinkFile,
  nPartsGRM,
  SparseGRMFile,
  tempDir = NULL,
  relatednessCutoff = 0.05,
  minMafGRM = 0.01,
  maxMissingGRM = 0.1,
  rm.tempFiles = FALSE
)

Value

A character string containing a message with the path to the output file where the sparse Genetic Relationship Matrix (SparseGRM) has been stored.

Arguments

PlinkFile: a path to PLINK binary files (without file extension). Note that the current version (gcta_1.93.1beta) of GCTA software does not support different prefix names for BIM, BED, and FAM files.
nPartsGRM: a numeric value (e.g. 250): GCTA software can split subjects to multiple parts. For UK Biobank data analysis, it is recommended to set nPartsGRM=250.
SparseGRMFile: a path to file of output to be passed to GRAB.NullModel.
tempDir: a path to store temp files from getTempFilesFullGRM. This should be consistent to the input of getTempFilesFullGRM. Default is system.file("SparseGRM", "temp", package = "GRAB").
relatednessCutoff: a cutoff for sparse GRM, only kinship coefficient greater than this cutoff will be retained in sparse GRM. (default=0.05)
minMafGRM: Minimal value of MAF cutoff to select markers (from PLINK files) to make sparse GRM. (default=0.01)
maxMissingGRM: Maximal value of missing rate to select markers (from PLINK files) to make sparse GRM. (default=0.1)
rm.tempFiles: a logical value indicating if the temp files generated in getTempFilesFullGRM will be deleted. (default=FALSE)

The following shows a typical workflow for creating a sparse GRM:

# Input data (We recommend setting nPartsGRM=250 for UKBB with N=500K):

GenoFile = system.file("extdata", "simuPLINK.bed", package = "GRAB")

PlinkFile = tools::file_path_sans_ext(GenoFile)

nPartsGRM = 2

Step 1: We strongly recommend parallel computing in high performance clusters (HPC).

# For Linux, get the file path of gcta64 by which command:

gcta64File <- system("which gcta64", intern = TRUE)

# For Windows, set the file path directly:

gcta64File <- "C:\\path\\to\\gcta64.exe"

# The temp outputs (may be large) will be in system.file("SparseGRM", "temp", package = "GRAB") by default:

for(partParallel in 1:nPartsGRM) getTempFilesFullGRM(PlinkFile, nPartsGRM, partParallel, gcta64File)

Step 2: Combine files in Step 1 to make a SparseGRMFile

tempDir = system.file("SparseGRM", "temp", package = "GRAB")

SparseGRMFile = gsub("temp", "SparseGRM.txt", tempDir)

getSparseGRM(PlinkFile, nPartsGRM, SparseGRMFile)

Details

Step 1: Run getTempFilesFullGRM to save temporary files to tempDir.
Step 2: Run getSparseGRM to combine the temporary files to make a SparseGRMFile to be passed to function GRAB.NullModel.

Users can customize parameters including (minMafGRM, maxMissingGRM, nPartsGRM), but functions getTempFilesFullGRM and getSparseGRM should use the same ones. Otherwise, package GRAB cannot accurately identify temporary files.