standardize: Standardize Allelic Ratio Data and Compute BAF and Z-Scores

Description

This function performs signal standardization of genotype data by aligning `theta` values (allelic ratios or normalized intensities) to expected genotype clusters. It outputs standardized BAF (B-allele frequency) and Z-scores per sample and marker.

Usage

standardize(
  data = NULL,
  genos = NULL,
  geno.pos = NULL,
  threshold.missing.geno = 0.9,
  threshold.geno.prob = 0.8,
  ploidy.standardization = NULL,
  threshold.n.clusters = NULL,
  n.cores = 1,
  out_filename = NULL,
  type = "intensities",
  multidog_obj = NULL,
  parallel.type = "PSOCK",
  verbose = TRUE,
  rm_outlier = TRUE,
  cluster_median = TRUE
)

Value

An object of class `"qploidy_standardization"` (list) with the following components:

info: Named vector of standardization parameters.
filters: Named vector summarizing how many markers were removed at each filtering step.
data: A data.frame containing merged BAF, Z-score, and genotype information by marker and sample.

Arguments

data

A `data.frame` containing the full dataset with the following columns:

MarkerName: Marker identifiers.

SampleName

Sample identifiers.

X

Reference allele intensity or count.

Y

Alternative allele intensity or count.

R

Total signal intensity or read depth (X + Y).

ratio

Allelic ratio, typically Y / (X + Y).

genos

A `data.frame` containing genotype dosage information for the reference panel. This should include samples of known ploidy and ideally euploid individuals. Required columns:

MarkerName: Marker identifiers.

SampleName

Sample identifiers.

geno

Estimated dosage (0, 1, 2, ...).

prob

Genotype call probability (used for filtering low-confidence genotypes).

geno.pos

A `data.frame` with marker position metadata. Required columns:

MarkerName: Marker identifiers.

Chromosome

Chromosome names.

Position

Base-pair positions on the genome.

threshold.missing.geno

Numeric (0–1). Maximum fraction of missing genotype data allowed per marker. Markers with a higher fraction will be removed.

threshold.geno.prob

Numeric (0–1). Minimum genotype call probability threshold. Genotypes with lower probability will be treated as missing.

ploidy.standardization

Integer. The ploidy level of the reference panel used for standardization.

threshold.n.clusters

Integer. Minimum number of expected dosage clusters per marker. For diploid data, this is typically 3 (corresponding to genotypes 0, 1, and 2).

n.cores

Integer. Number of cores to use in parallel computations (e.g., for cluster center estimation and BAF generation).

out_filename

Optional. Path to save the final standardized dataset to disk as a CSV file (suitable for Qploidy).

type

Character. Type of data used for clustering:

"intensities": For array-based allele intensity data.

"counts"

For sequencing data.

"updog"

Automatically set when `multidog_obj` is provided.

multidog_obj

Optional. An object of class `multidog` from the `updog` package, containing model fits and estimated biases. If provided, this will override the `type` parameter and use `updog`'s expected cluster positions.

parallel.type

Character. Parallel backend to use (`"FORK"` or `"PSOCK"`). `"FORK"` is faster but only works on Unix-like systems.

verbose

Logical. If `TRUE`, prints progress and filtering information to the console.

rm_outlier

Logical. If `TRUE`, uses Bonferroni-Holm corrected residuals to remove outliers before estimating cluster centers.

cluster_median

Logical. If `TRUE`, uses the median of theta values to estimate cluster centers. If `FALSE`, uses the mean.

Details

Reference genotypes are used to estimate cluster centers either from dosage data (e.g., via `fitpoly` or `updog`) or using an `updog` `multidog` object directly. This function supports both array-based (intensity) and sequencing-based (count) data.

It applies marker and genotype-level quality filters, uses parallel computing to estimate BAF, and generates a final annotated output suitable for CNV or dosage variation analyses.