n_eff: Calculates the effective sample size of the data

Description

The effective sample size (\(n_{eff}\)) is the equivalent number of independent haplotypes that gives the same variance as that observed under the given population. The variance in question is for the weighted sample mean ancestral allele frequency estimator. It follows that \(n_{eff}\) equals the inverse of the weighted mean kinship. If max=TRUE, a calculation is performed that implicitly uses optimal weights which maximize \(n_{eff}\): here \(n_{eff}\) equals the sum of the elements of the inverse kinship matrix. However, if nonneg=TRUE and if the above solution has negative weights (common), optimal non-negative weights are found instead (there are three algorithms available, see algo). If max=FALSE, then the input weights are used in this calculation, and if weights are NULL, uniform weights are used.

Usage

n_eff(kinship, max = TRUE, weights = NULL, nonneg = TRUE,
  algo = c("gradient", "newton", "heuristic"), tol = 1e-10)

Arguments

kinship

An \(n \times n\) kinship matrix.

max

If TRUE, returns the maximum \(n_{eff}\) value among those computed using all possible vectors of weights that sum to one (and which are additionally non-negative if nonneg=TRUE). If FALSE, \(n_{eff}\) is computed using the specific weight vector provided.

weights

Weights for individuals (optional). If NULL, uniform weights are used. This parameter is ignored if max=TRUE.

nonneg

If TRUE (default) and max=TRUE, non-negative weights that maximize \(n_{eff}\) are found. See algo. This has no effect if max=FALSE.

algo

Algorithm for finding optimal non-negative weights (applicable only if nonneg=TRUE and max=TRUE and the weights found by matrix inversion are non-negative). May be abbreviated. If "gradient" (default), an optimized gradient descent algorithm is used (fastest; recommended). If "newton", the exact multivariate newton's Method is used (slowest since \((n+1) \times (n+1)\) Hessian matrix needs to be inverted at every iteration; use if possible to confirm that "gradient" gives the best answer). If "heuristic", if the optimal solution by the inverse matrix method contains negative weights, the most negative weight in an iteration is forced to be zero in all subsequent iterations and the rest of the weights are solved for using the inverse matrix method, repeating until all resulting weights are non-negative (also slow, since inversion of large matrices is required; least likely to find optimal solution).

tol

Tolerance parameter for "gradient" and "newton" algorithms. The algorithms converge when the norm of the step vector is smaller than this tolerance value.

Value

A list containing n_eff and weights (optimal weights if max = TRUE, input weights otherwise).

Details

The maximum \(n_{eff}\) possible is \(2n\), where \(n\) is the number of individuals; this value is attained only when all haplotypes are independent (a completely unstructured population in Hardy-Weinberg equilibrium). The minimum \(n_{eff}\) possible is 1, which is attained in an extremely structured population with \(F_{ST}\) of 1, where every individual has exactly the same haplotype at every locus (no heterozygotes). Moreover, for \(K\) extremely-differentiated subpopulations (\(F_{ST}\)=1 per subpopulation) \(n_{eff}\) equals \(K\). In this way, \(n_{eff}\) is smaller than the ideal value of \(2n\) depending on the amount of kinship (covariance) in the data.

Examples

Run this code

# NOT RUN {
# Get n_eff from a genotype matrix

# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals

# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object

# estimate the kinship matrix "kinship" from the genotypes "X"!
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
weights <- weights_subpops(subpops) # can weigh individuals so subpopulations are balanced

# use kinship matrix to calculate n_eff
# default mode returns maximum n_eff possible across all non-negative weights that sum to one
# also returns the weights that were optimal
obj <- n_eff(kinship)
n_eff_max <- obj$n_eff
w_max <- obj$weights

# version that uses weights provided
obj <- n_eff(kinship, max = FALSE, weights = weights)
n_eff_w <- obj$n_eff
w <- obj$weights # returns input weights renormalized for good measure

# no (or NULL) weights implies uniform weights
obj <- n_eff(kinship, max = FALSE)
n_eff_u <- obj$n_eff
w <- obj$weights # uniform weights

# }

Run the code above in your browser using DataLab