Learn R Programming

BigDataStatMeth (version 1.0.3)

bdRemoveMAF_hdf5: Remove SNPs Based on Minor Allele Frequency

Description

Filters SNPs (Single Nucleotide Polymorphisms) based on Minor Allele Frequency (MAF) in genomic data stored in HDF5 format.

Usage

bdRemoveMAF_hdf5(
  filename,
  group,
  dataset,
  outgroup,
  outdataset,
  maf,
  bycols,
  blocksize,
  overwrite = NULL
)

Value

List with components. If an error occurs, all string values are returned as empty strings (""):

fn

Character string with the HDF5 filename

ds

Character string with the full dataset path to the filtered dataset (group/dataset)

nremoved

Integer with the number of SNPs removed due to low Minor Allele Frequency (MAF)

Arguments

filename

Character string. Path to the HDF5 file.

group

Character string. Path to the group containing input dataset.

dataset

Character string. Name of the dataset to filter.

outgroup

Character string. Output group path for filtered data.

outdataset

Character string. Output dataset name for filtered data.

maf

Numeric (optional). MAF threshold for filtering (0-1). Default is 0.05. SNPs with MAF above this threshold are removed.

bycols

Logical (optional). Whether to process by columns (TRUE) or rows (FALSE). Default is FALSE.

blocksize

Integer (optional). Block size for processing. Default is 100. Larger values use more memory but may be faster.

overwrite

Logical (optional). Whether to overwrite existing dataset. Default is FALSE.

Details

This function provides efficient MAF-based filtering capabilities with:

  • Filtering options:

    • MAF threshold-based filtering

    • Row-wise or column-wise processing

    • Block-based processing

  • Implementation features:

    • Memory-efficient processing

    • Block-based operations

    • Safe file operations

    • Progress reporting

The function supports both in-place modification and creation of new datasets.

References

  • The HDF Group. (2000-2010). HDF5 User's Guide.

  • Marees, A. T., et al. (2018). A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis. International Journal of Methods in Psychiatric Research, 27(2), e1608.

See Also

  • bdRemovelowdata_hdf5 for removing low-representation SNPs

  • bdImputeSNPs_hdf5 for imputing missing SNP values

Examples

Run this code
if (FALSE) {
library(BigDataStatMeth)

# Create test SNP data
snps <- matrix(sample(c(0, 1, 2), 1000, replace = TRUE,
                     prob = c(0.7, 0.2, 0.1)), 100, 10)

# Save to HDF5
fn <- "snp_data.hdf5"
bdCreate_hdf5_matrix(fn, snps, "genotype", "raw_snps",
                     overwriteFile = TRUE)

# Remove SNPs with high MAF
bdRemoveMAF_hdf5(
  filename = fn,
  group = "genotype",
  dataset = "raw_snps",
  outgroup = "genotype_filtered",
  outdataset = "filtered_snps",
  maf = 0.1,
  bycols = TRUE,
  blocksize = 50
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}
}

Run the code above in your browser using DataLab