prim.box: PRIM for multivariate data

Description

PRIM for multivariate data. Result is an estimate of the highest density region (HDR).

Usage

prim.box(x, y, box.init=NULL, peel.alpha=0.05, paste.alpha=0.01,
     mass.min=0.05, threshold, pasting=TRUE, verbose=FALSE,
     threshold.type=0)
prim.hdr(prim, threshold, threshold.type)
prim.combine(prim1, prim2)

Arguments

matrix of data values

vector of response values

box.init

initial covering box

peel.alpha

peeling quantile tuning parameter

paste.alpha

pasting quantile tuning parameter

mass.min

minimum mass tuning parameter

threshold

threshold tuning parameter(s)

threshold.type

1 = positive HDR, -1 = negative HDR, 0 = both HDR

pasting

flag for pasting

verbose

flag for printing output during execution

prim,prim1,prim2

objects of type prim

Value

-- prim.box produces a PRIM estimate of HDRs, an object of type prim, which is a list with 8 fields:
xlist of data matrices
ylist of response variable vectors
y.meanlist of vectors of box mean for y
boxlist of matrices of box limits (first row = minima, second row = maxima)
massvector of box masses (proportion of points inside a box)
num.classtotal number of PRIM boxes
num.hdr.classtotal number of PRIM boxes which form the HDR
indHDR indicator: 1 = positive HDR, -1 = negative HDR
The above lists have num.class fields, one for each box.
-- prim.hdr takes a prim object and computes HDRs with different threshold values. Returns another prim object. This is much faster for experimenting with different threshold values than calling prim.box each time.
-- prim.combine combines two prim objects into a single prim object. Useful for combining positive and negative HDRs. Usually used in conjunction with prim.hdr. See examples below.

code

prim.hdr

Details

The data are $(\bold{X}_1, Y_1), \dots, (\bold{X}_n, Y_n)$ where $\bold{X}_i$ is d-dimensional and $Y_i$ is a scalar response. PRIM finds modal (and/or anti-modal) regions in the conditional expectation $m(\bold{x}) = \bold{E} (Y | \bold{x}).$ These regions are also called the highest density regions (HDR).

In general, $Y_i$ can be real-valued. See vignette("prim"). Here, we focus on the special case for binary $Y_i$. Let $Y_i$ = 1 when $\bold{X}_i \sim F^+$; and $Y_i$ = -1 when $\bold{X}_i \sim F^-$ where $F^+$ and $F^-$ are different distribution functions. In this set-up, PRIM finds the regions where $F^+$ and $F^-$ are most different.

The tuning parameters peel.alpha and paste.alpha control the `patience' of PRIM. Smaller values involve more patience. Larger values less patience. The peeling steps remove data from a box till either the box mean is smaller than threshold or the box mass is less than mass.min. Pasting is optional, and is used to correct any possible over-peeling. The default values for peel.alpha, paste.alpha and mass.min are taken from Friedman & Fisher (1999).

Specifying the type of HDR is controlled by threshold and threshold.type:

{For threshold.type=1, then we search for positive HDR {$m(\bold{x}) \geq$ threshold}.} {For threshold.type=-1, then we search for negative HDR {$m(\bold{x}) \leq$ threshold}.} {For threshold.type=0, then we search for both the positive and negative HDR. In this case make sure that threshold is (positive HDR threshold, negative HDR threshold).}

References

Friedman, J.H. & Fisher, N.I. (1999) Bump-hunting for high dimensional data, Statistics and Computing, 9, 123--143.

Examples

Run this code

n <- 1000
set.seed(88192)

mus.p <- rbind(c(0,0), c(2,0), c(1, 2), c(2.5, 2))
Sigmas.p <- 0.125*rbind(diag(2), diag(c(0.5, 0.5)),
   diag(c(0.125, 0.25)), diag(c(0.125, 0.25))) 
props.p <- c(0.5, 0.25, 0.125, 0.125)

mus.m <- rbind(c(0,0), c(2,0), c(2.5, 2))
Sigmas.m <- 0.125*rbind(invvech(c(1,-0.6,1)),
   diag(c(0.5, 0.5)),diag(c(0.125, 0.25))) 
props.m <- c(0.625, 0.25, 0.125)

x.p <- rmvnorm.mixt(n, mus.p, Sigmas.p, props.p)
x.m <- rmvnorm.mixt(n, mus.m, Sigmas.m, props.m)
x <- rbind(x.p, x.m)
y <- c(rep(1, nrow(x.p)), rep(-1, nrow(x.m)))
  ## 1 = positive sample, -1 = negative sample

y.thr <- c(1, -0.35)

## using only one command

x.prim1 <- prim.box(x=x, y=y, threshold=y.thr, threshold.type=0)

## alternative - requires more commands but allows more control
## in intermediate stages

x.prim.hdr.plus <- prim.box(x=x, y=y, threshold.type=1,
   threshold=1)

x.prim.minus <- prim.box(x=x, y=y, threshold.type=-1)
summary(x.prim.minus)
   ## threshold too high, try lower one

x.prim.hdr.minus <- prim.hdr(x.prim.minus, threshold=-0.35,
   threshold.type=-1)
x.prim2 <- prim.combine(x.prim.hdr.plus, x.prim.hdr.minus)
       
plot(x.prim2)

col <- x.prim2$ind
col[col==1] <- "orange"
col[col==-1] <- "blue" 
plot(x.prim2, col=col)

summary(x.prim1)
summary(x.prim2) ## should be exactly the same as command above

Run the code above in your browser using DataLab