kde.test: Kernel density based two-sample comparison test

Description

Kernel density based two-sample comparison test for 1- to 6-dimensional data.

Usage

kde.test(x1, x2, H1, H2, h1, h2, psi1, psi2, var.fhat1, var.fhat2, 
    binned=FALSE, bgridsize, verbose=FALSE, pilot="dscalar")
Hpi.kfe(x, nstage=2, pilot="dscalar", pre="sphere", Hstart, binned=FALSE, 
    bgridsize, amise=FALSE, deriv.order=0, verbose=FALSE, optim.fun="nlm")
hpi.kfe(x, nstage=2, binned=FALSE, bgridsize, amise=FALSE, deriv.order=0)

Arguments

x,x1,x2

vector/matrix of data values

H1,H2,h1,h2

bandwidth matrices/scalar bandwidths. If these are missing, Hpi.kfe or hpi.kfe is called by default.

psi1,psi2

zero-th order kernel functional estimates

var.fhat1,var.fhat2

sample variance of KDE estimates evaluated at x1, x2

binned

flag for binned estimation. Default is FALSE.

bgridsize

vector of binning grid sizes

verbose

flag to print out progress information. Default is FALSE.

nstage

number of stages in the plug-in bandwidth selector (1 or 2)

pilot

"dscalar" = single pilot bandwidth "dunconstr" = single unconstrained pilot bandwidth

pre

"scale" = pre.scale, "sphere" = pre.sphere

Hstart

initial bandwidth matrix, used in numerical optimisation

amise

flag to return the minimal scaled PI value

deriv.order

derivative order of kfe (kernel functional estimate). Only deriv.order=0 is currently implemented.

optim.fun

optimiser function: one of nlm or optim.

Value

A list with fields
TstatT statistic
zstatz statistic - normalised version of Tstat
pvaluep-value of the double sided test
mean,varmean and variance of null distribution
var.fhat1,var.fhat2sample variances of KDE values evaluated at data points
n1,n2sample sizes
H1,H2bandwidth matrices
psi1,psi12,psi21,psi2kernel functional estimates

Details

--The null hypothesis is $H_0: f_1 \equiv f_2$ where $f_1, f_2$ are the respective density functions. The measure of discrepancy is the integrated $L_2$ error (ISE) $T = \int [f_1(\bold{x}) - f_2(\bold{x})]^2 \, d \bold{x}$. If we rewrite this as $T = \psi_1 - \psi_{12} - \psi_{21} + \psi_2$ where $\psi_{uv} = \int f_u (\bold{x}) f_v (\bold{x}) \, d \bold{x}$, then we can use kernel functional estimators. Duong et al. (2012) show that this test statistic has a null distribution which is asymptotically normal, so no bootstrap resampling is required to compute an approximate p-value. As of ks 1.8.8, kde.test(,binned=TRUE) invokes binned estimation for the computation of the bandwidth selectors, and not the test statistic and p-value.

--Hpi.kfe is the optimal plug-in bandwidth for $r$-th order kernel functional estimator based on the unconstrained pilot selectors of Chacon & Duong (2010). This is automatically called by kde.test to estimate the $\psi$ functionals with $r=0$. hpi.kfe is the 1-d equivalent, using the formulas from Wand & Jones (1995, p.70).

References

Chacon, J.E. & Duong, T. (2010) Multivariate plug-in bandwidth selection with unconstrained pilot matrices. Test, 19, 375-398.

Duong, T., Goud, B & Schauer, K. (2012) Closed-form density-based framework for automatic detection of cellular morphology changes. PNAS, 109, 8382-8387.

Wand, M.P. & Jones, M.C. (1995) Kernel Smoothing. Chapman & Hall/CRC, London.

Examples

Run this code

## univariate example
set.seed(8192)
samp <- 1000
x <- rnorm.mixt(n=samp, mus=0, sigmas=1, props=1)
y <- rnorm.mixt(n=samp, mus=0.25, sigmas=1, props=1)
kde.test(x1=x, x2=y)$pvalue   ## reject H0: f1=f2


## bivariate example
mus1 <- rbind(c(1,-1), c(-1,1))
Sigmas1 <- rbind(invvech(c(4/9, 4/15, 4/9)), invvech(c(4/9, 4/15, 4/9)))
props1 <- c(1,1)/2
mus2 <- rbind(c(1,-1), c(-1,1))
Sigmas2 <- rbind(invvech(c(4/9, 14/45, 4/9)), 4/9*diag(2))
props2 <- c(1,1)/2

set.seed(8192)
samp <- 1000
x <- rmvnorm.mixt(n=samp, mus=mus1, Sigmas=Sigmas1, props=props1)
y <- rmvnorm.mixt(n=samp, mus=mus2, Sigmas=Sigmas2, props=props2)
kde.test(x1=x, x2=y, binned=TRUE)$pvalue    ## reject H0: f1=f2

Run the code above in your browser using DataLab