mixtools (version 1.0.4)

spEMsymlocN01: semiparametric EM-like algorithm for univariate mixture in False Discovery Rate (FDR) estimation

Description

Return semiparametric EM-like algorithm output for a 2-components mixture model with one component set to Normal(0,1), and the other component being a unspecified but symmetric density with a location parameter. This model is tailored to FDR estimation on probit transform (qnorm) of p-values arising from multiple testing.

Usage

spEMsymlocN01(x, mu0 = 2, bw = bw.nrd0(x), h=bw, eps = 1e-8, maxiter = 100, verbose = FALSE, plotf = FALSE)

Arguments

x
A vector of length n consisting of the data, probit transform of pvalues, preferably sorted.
mu0
Starting value of vector of component means. If not set then the initial value is randomly generated by a kmeans of the data in two bins. Since component 1 is theoretically normal(0,1), mu[1] must be 0 and mu[2] some negative value (see details).
bw
Bandwidth for weighted kernel density estimation.
h
Alternative way to specify the bandwidth, to provide backward compatibility.
eps
Tolerance limit for declaring algorithm convergence. Convergence is declared before maxiter iterations whenever the maximum change in any coordinate of the lambda (mixing proportion estimates) and mu (mean of the semiparametric component) vector does not exceed eps
maxiter
The maximum number of iterations allowed; convergence may be declared before maxiter iterations (see eps above).
verbose
If TRUE, print updates for every iteration of the algorithm as it runs.
plotf
If TRUE, plots successive updates of the nonparametric density estimate over iterations. Mostly for testing purpose.

Value

spEMsymlocN01 returns a list of class spEMN01 with the following items:
data
The raw data (an $n x r$ matrix).
posteriors
An $n x 2$ matrix of posterior probabilities for observations. This can be used in, e.g., plotFDR to plot False Discovery Rate estimates.
bandwidth
Same as the bw input argument, returned because this information is needed by any method that produces density estimates from the output.
lambda
The sequence of mixing proportions over iterations.
lambdahat
The final estimate for mixing proportions.
mu
the sequence of second component mean over iterations.
muhat
the final estimate of second component mean.
symmetric
Flag indicating that the kernel density estimate is using a symmetry assumption.

Details

This algorithm is a specific version of semiparametric EM-like algorithm similar in spirit to spEMsymloc, but specialized for FDR estimation on probit transform (qnorm) of p-values in multiple testing framework. In this model, component 1 corresponds to the individuals under the null hypothesis, i.e. theoretically normal(0,1) distributed, whereas component 2 corresponds to individuals in the alternative hypothesis, with typically very small p-values and consequently negative values for probit(p) data. This model only assumes that these individuals come from an unspecified but symmetric density with a location parameter, as in Bordes and Vandekerkhove (2010) and Chauveau et al. (2014).

References

  • Bordes, L. and Vandekerkhove, P. (2010). Semiparametric two-component mixture model with a known component: an asymptotically normal estimator. Mathematical Methods of Statistics, 19(1):22-41
  • Chauveau, D., Saby, N., Orton, T. G., Lemercier B., Walter, C. and Arrouys, D. (2014) Large-scale simultaneous hypothesis testing in monitoring carbon content from french soil database: A semi-parametric mixture approach. Geoderma 219-220 (2014): 117-124.

See Also

spEMsymloc, normalmixEM, npEM, plot.spEMN01, plotFDR

Examples

Run this code
## Probit transform of p-values
## from a Beta-Uniform mixture model
## comparion of parametric and semiparametric EM fit
## Note: in actual situations n=thousands 
set.seed(50)
n=300 # nb of multiple tests
m=2 # 2 mixture components
a=c(1,0.1); b=c(1,1); lambda=c(0.6,0.4) # parameters
z=sample(1:m, n, rep=TRUE, prob = lambda)
p <- rbeta(n, shape1 = a[z], shape2 = b[z]) # p-values
o <- order(p)
cpd <- cbind(z,p)[o,] # sorted complete data, z=1 if H0, 2 if H1
p <- cpd[,2] # sorted p-values

y <- qnorm(p) # probit transform of the pvalues
# gaussian EM fit with component 1 constrained to N(0,1)
s1 <- normalmixEM(y, mu=c(0,-4), 
				mean.constr = c(0,NA), sd.constr = c(1,NA)) 
s2 <- spEMsymlocN01(y, mu0 = c(0,-3)) # spEM with N(0,1) fit
hist(y, freq = FALSE, col = 8, main = "histogram of probit(pvalues)")
plot(s2, add.plot = TRUE, lwd = 2)

# Exemples of plot capabilities
# Note: posteriors must be ordered by p for plot.FDR
# plotFDR(s1$post) # when true complete data not observed
# plotFDR(s1$post, s2$post) # comparing 2 strategies
plotFDR(s1$post, s2$post, lg1 = "normalmixEM", lg2 = "spEMsymlocN01", 
		complete.data = cpd) # with true FDR computed from z

Run the code above in your browser using DataLab