Learn R Programming

BigDataStatMeth (version 1.0.3)

bdRemovelowdata_hdf5: Remove Low-Representation SNPs from HDF5 Dataset

Description

Removes SNPs (Single Nucleotide Polymorphisms) with low representation from genomic data stored in HDF5 format.

Usage

bdRemovelowdata_hdf5(
  filename,
  group,
  dataset,
  outgroup,
  outdataset,
  pcent,
  bycols,
  overwrite = NULL
)

Value

List with components. If an error occurs, all string values are returned as empty strings (""):

fn

Character string with the HDF5 filename

ds

Character string with the full dataset path to the filtered dataset (group/dataset)

nremoved

Integer with the number of rows/columns removed due to low data quality

Arguments

filename

Character string. Path to the HDF5 file.

group

Character string. Path to the group containing input dataset.

dataset

Character string. Name of the dataset to filter.

outgroup

Character string. Output group path for filtered data.

outdataset

Character string. Output dataset name for filtered data.

pcent

Numeric (optional). Threshold percentage for removal (0-1). Default is 0.5. SNPs with representation below this threshold are removed.

bycols

Logical (optional). Whether to filter by columns (TRUE) or rows (FALSE). Default is TRUE.

overwrite

Logical (optional). Whether to overwrite existing dataset. Default is FALSE.

Details

This function provides efficient filtering capabilities for genomic data with support for:

  • Filtering options:

    • Row-wise or column-wise filtering

    • Configurable threshold percentage

    • Flexible output location

  • Implementation features:

    • Memory-efficient processing

    • Safe file operations

    • Comprehensive error handling

    • Progress reporting

The function supports both in-place modification and creation of new datasets.

References

  • The HDF Group. (2000-2010). HDF5 User's Guide.

  • Marchini, J., & Howie, B. (2010). Genotype imputation for genome-wide association studies. Nature Reviews Genetics, 11(7), 499-511.

See Also

  • bdImputeSNPs_hdf5 for imputing missing SNP values

  • bdCreate_hdf5_matrix for creating HDF5 matrices

Examples

Run this code
if (FALSE) {
library(BigDataStatMeth)

# Create test SNP data with missing values
snps <- matrix(sample(c(0, 1, 2, NA), 100, replace = TRUE,
                     prob = c(0.3, 0.3, 0.3, 0.1)), 10, 10)

# Save to HDF5
fn <- "snp_data.hdf5"
bdCreate_hdf5_matrix(fn, snps, "genotype", "raw_snps",
                     overwriteFile = TRUE)

# Remove SNPs with low representation
bdRemovelowdata_hdf5(
  filename = fn,
  group = "genotype",
  dataset = "raw_snps",
  outgroup = "genotype_filtered",
  outdataset = "filtered_snps",
  pcent = 0.3,
  bycols = TRUE
)

# Cleanup
if (file.exists(fn)) {
  file.remove(fn)
}
}

Run the code above in your browser using DataLab