fit.pfc: Fit a high-dimensional principal fitted components model using the method of Cook, Forzani, and Rothman (2012).

Description

Let $(x_1, y_1), \ldots, (x_n, y_n)$ denote the $n$ measurements of the predictor and response, where $x_i\in R^p$ and $y_i \in R$. The model assumes that these measurements are a realization of $n$ independent copies of the random vector $(X,Y)'$, where $$X = \mu_X + \Gamma \beta{f(Y) - \mu_f}+ \epsilon,$$ $\mu_X\in R^p$; $\Gamma\in R^{p\times d}$ with rank $d$; $\beta \in R^{d\times r}$ with rank $d$; $f: R \rightarrow R^r$ is a known vector valued function; $\mu_f = E{f(Y)}$; $\epsilon \sim N_p(0, \Delta)$; and $Y$ is independent of $\epsilon$. The central subspace is $\Delta^{-1} {\rm span}(\Gamma)$. This function computes estimates of these model parameters by imposing constraints for identifiability. The mean parameters $\mu_X$ and $\mu_f$ are estimated with $\bar x = n^{-1}\sum_{i=1}^n x_i$ and $\bar f = n^{-1} \sum_{i=1}^n f(y_i)$. Let $\widehat\Phi = n^{-1}\sum_{i=1}^{n} {f(y_i) - \bar f}{f(y_i) - \bar f}'$, which we require to be positive definite. Given a user-specified weight matrix $\widehat W$, let $$(\widehat\Gamma, \widehat\beta) = \arg\min_{G\in R^{p\times d}, B \in R^{d\times r}} \sum_{i=1}^n [x_i - \bar x - GB{f(y_i) - \bar f}]'\widehat W [x_i - \bar x - GB{f(y_i) - \bar f}],$$ subject to the constraints that $G'\widehat W G$ is diagonal and $B \widehat\Phi B' = I$. The sufficient reduction estimate $\widehat R: R^p \rightarrow R^d$ is defined by $$\widehat R(x) = (\widehat\Gamma'\widehat W \widehat\Gamma)^{-1} \widehat\Gamma' \widehat W(x - \bar x).$$

Usage

fit.pfc(X, y, r=4, d=NULL, F.user=NULL, weight.type=c("sample", "diag", "L1"), 
        lam.vec=NULL, kfold=5, silent=TRUE, qrtol=1e-10, cov.tol=1e-4, 
        cov.maxit=1e3, NPERM=1e3, level=0.01)

Arguments

The predictor matrix with $n$ rows and $p$ columns. The $i$th row is $x_i$ defined above.

The vector of measured responses with $n$ entries. The $i$th entry is $y_i$ defined above.

When polynomial basis functions are used (which is the case when F.user=NULL), r is the polynomial order, i.e, $f(y) = (y, y^2, \ldots, y^r)'$. The default is r=4. This argument is not used when F.user

The dimension of the central subspace defined above. This must be specified by the user when weight.type="L1". If unspecified by the user this function will use the sequential permutation testing procedure, described in Section 8.2 of Cook,

F.user

A matrix with $n$ rows and $r$ columns, where the $i$th row is $f(y_i)$ defined above. This argument is optional, and will typically be used when polynomial basis functions are not desired.

weight.type

The type of weight matrix estimate $\widehat W$ to use. Let $\widehat\Delta$ be the observed residual sample covariance matrix for the multivariate regression of X on $f$(Y) with $n-r-1$ scaling. There are three options for $\w

lam.vec

A vector of candidate tuning parameter values to use when weight.type="L1". If this vector has more than one entry, then kfold cross validation will be performed to select the optimal tuning parameter value.

kfold

The number of folds to use in cross-validation to select the optimal tuning parameter when weight.type="L1". Only used if lam.vec has more than one entry.

silent

Logical. When silent=FALSE, progress updates are printed.

qrtol

The tolerance for calls to qr.solve().

cov.tol

The convergence tolerance for the QUIC algorithm used when weight.type="L1".

cov.maxit

The maximum number of iterations allowed for the QUIC algorithm used when weight.type="L1".

NPERM

The number of permutations to used in the sequential permutation testing procedure to select $d$. Only used when d is unspecified.

level

The significance level to use to terminate the sequential permutation testing procedure to select $d$.

Value

A list with
Gamhatthis is $\widehat\Gamma$ described above.
bhatthis is $\widehat\beta$ described above.
Rmatthis is $\widehat W\widehat\Gamma(\widehat\Gamma'\widehat W \widehat\Gamma)^{-1}$.
Whatthis is $\widehat W$ described above.
dthis is $d$ described above.
rthis is $r$ described above.
GWGthis is $\widehat\Gamma'\widehat W \widehat\Gamma$
fca matrix with $n$ rows and $r$ columns where the $i$th row is $f(y_i) - \bar f$.
Xca matrix with $n$ rows and $p$ columns where the $i$th row is $x_i - \bar x$.
ythe vector of $n$ response measurements.
mxthis is $\bar x$ described above.
mfthis is $\bar f$ described above.
best.lamthis is selected tuning parameter value used when weight.type="L1", will be NULL otherwise.
lam.vecthis is the vector of candidate tuning parameter values used when weight.type="L1", will be NULL otherwise.
err.vecthis is the vector of validation errors from cross validation, one error for each entry in lam.vec. Will be NULL unless weight.type="L1" and lam.vec has more than one entry.
test.infoa dataframe that summarizes the results from the sequential testing procedure. Will be NULL unless d is unspecified.

Details

See Cook, Forzani, and Rothman (2012) more information.

References

Cook, R. D., Forzani, L., and Rothman, A. J. (2012). Estimating sufficient reductions of the predictors in abundant high-dimensional regressions. Annals of Statistics 40(1), 353-384.

Cho-Jui Hsieh, Matyas A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar (2011). Sparse inverse covariance matrix estimation using quadratic approximation. Advances in Neural Information Processing Systems 24, 2011, p. 2330--2338.

Examples

Run this code

set.seed(1)
n=20
p=30
d=2
y=sqrt(12)*runif(n)
Gam=matrix(rnorm(p*d), nrow=p, ncol=d)
beta=diag(2)
E=matrix(0.5*rnorm(n*p), nrow=n, ncol=p)
V=matrix(c(1, sqrt(12), sqrt(12), 12.8), nrow=2, ncol=2)
tmp=eigen(V, symmetric=TRUE)
V.msqrt=tcrossprod(tmp$vec*rep(tmp$val^(-0.5), each=2), tmp$vec)
Fyc=cbind(y-sqrt(3),y^2-4)%*%V.msqrt
X=0+Fyc%*%t(beta)%*%t(Gam) + E

fit=fit.pfc(X=X, y=y, r=3, weight.type="sample")
## display hypothesis testing information for selecting d
fit$test.info
##  make a response versus fitted values plot
plot(pred.response(fit), y)

Run the code above in your browser using DataLab