Generate knockoff variables for genotype data using the Multiple knockoff method with leveraging scores and clustering specifically optimized for genetic variant data.
create_knockoffs(
X,
pos,
chr_info = NULL,
sample_ids = NULL,
M = 5,
save_gds = TRUE,
output_dir = NULL,
start = NULL,
end = NULL,
corr_max = 0.75,
maxN_neighbor = Inf,
maxBP_neighbor = 1e+05,
n_AL = floor(10 * nrow(X)^(1/3) * log(nrow(X))),
thres_ultrarare = 25,
R2_thres = 1,
prob_eps = 1e-12,
irlba_maxit = 1500
)If save_gds is TRUE, returns the path to the saved GDS file. Otherwise, returns a list of M matrices, each of the same dimensions as X, containing knockoff variables.
A sparse matrix (n x p) of genotype data where n is the number of samples and p is the number of SNPs. Typically coded as 0, 1, 2 for genotype dosages.
A numeric vector of SNP positions (in base pairs) for linkage disequilibrium-aware knockoff generation.
Optional chromosome information. Can be either: (1) A data frame with chromosome information from BIM file containing a column named "chr" or "CHR" with chromosome numbers, or (2) A vector of chromosome numbers directly. Chromosome information will be automatically extracted.
A character vector of sample IDs (default: NULL, will generate)
Number of knockoff copies to generate (default: 5). More copies can improve statistical power but increase computational cost.
Whether to save knockoffs to GDS format (default: TRUE)
Directory to save GDS files (default: NULL, uses tempdir())
Start position for file naming (default: min(pos))
End position for file naming (default: max(pos))
Maximum correlation threshold for clustering variants (default: 0.75). Higher values create fewer, larger clusters.
Maximum number of neighboring variants to consider for each variant (default: Inf).
Maximum base pair distance to consider variants as neighbors (default: 100,000 bp).
Number of samples to use for adaptive lasso fitting (default: automatically determined based on sample size).
Minimum minor allele count threshold for variant inclusion (default: 25).
R-squared threshold for model fitting (default: 1).
Minimum probability value to prevent numerical issues (default: 1e-12).
Maximum iterations for truncated SVD (default: 1500).