PLRModels (version 1.1)

plrm.cv: Cross-validation bandwidth selection in PLR models

Description

From a sample ${(Y_i, X_{i1}, ..., X_{ip}, t_i): i=1,...,n}$, this routine computes, for each $l_n$ considered, an optimal pair of bandwidths for estimating the regression function of the model $$Y_i= X_{i1}*\beta_1 +...+ X_{ip}*\beta_p + m(t_i) + \epsilon_i,$$ where $$\beta = (\beta_1,...,\beta_p)$$ is an unknown vector parameter and $$m(.)$$ is a smooth but unknown function. The random errors, $\epsilon_i$, are allowed to be time series. The optimal pair of bandwidths, (b.opt, h.opt), is selected by means of the leave-($2l_n + 1$)-out cross-validation procedure. The bandwidth b.opt is used in the estimate of $\beta$, while the pair of bandwidths (b.opt, h.opt) is considered in the estimate of $m$. Kernel smoothing, combined with ordinary least squares estimation, is used.

Usage

plrm.cv(data = data, b.equal.h = TRUE, b.seq=NULL, h.seq=NULL, 
num.b = NULL, num.h = NULL, w = NULL, num.ln = 1, ln.0 = 0, 
step.ln = 2, estimator = "NW", kernel = "quadratic")

Arguments

data
data[,1] contains the values of the response variable, $Y$; data[, 2:(p+1)] contains the values of the "linear" explanatory variables, $X_1, ..., X_p$; data[, p+2] contains the values of the "nonparametric"
b.equal.h
if TRUE (the default), the same bandwidth is used for estimating both $\beta$ and $m$.
b.seq
sequence of considered bandwidths, b, in the CV function for estimating $\beta$. If NULL (the default), num.b equidistant values between zero and a quarter of the range of ${t_i}$ are considered.
h.seq
sequence of considered bandwidths, h, in the pair of bandwidths (b, h) used in the CV function for estimating $m$. If NULL (the default), num.h equidistant values between zero and a quarter of the range
num.b
number of values used to build the sequence of considered bandwidths for estimating $\beta$. If b.seq is not NULL, num.b=length(b.seq). Otherwise, if both num.b and num.h are NULL
num.h
pairs of bandwidths (b, h) are used for estimating $m$, num.h being the number of values considered for h. If h.seq is not NULL, num.h=length(h.seq). Otherwise, if both nu
w
support interval of the weigth function in the CV function. If NULL (the default), $(q_{0.1}, q_{0.9})$ is considered, where $q_p$ denotes the quantile of order $p$ of ${t_i}$.
num.ln
number of values for $l_n$: after estimating $\beta$, $2l_{n} + 1$ observations around each point $t_i$ are eliminated to estimate $m(t_i)$ in the CV function. The default is 1.
ln.0
minimum value for $l_n$. The default is 0.
step.ln
distance between two consecutives values of $l_n$. The default is 2.
estimator
allows us the choice between NW (Nadaraya-Watson) or LLP (Local Linear Polynomial). The default is NW.
kernel
allows us the choice between gaussian, quadratic (Epanechnikov kernel), triweight or uniform kernel. The default is quadratic.

Value

  • bh.optdataframe containing, for each ln considered, the selected value for (b,h).
  • CV.optCV.opt[k] is the minimum value of the CV function when de k-th value of ln is considered.
  • CVan array containing the values of the CV function for each pair of bandwidths and ln considered.
  • b.seqsequence of considered bandwidths, b, in the CV function for estimating $\beta$.
  • h.seqsequence of considered bandwidths, h, in the pair of bandwidths (b, h) used in the CV function for estimating $m$.
  • wsupport interval of the weigth function in the CV function.

Details

A weight function (specifically, the indicator function 1$_{[w[1] , w[2]]}$) is introduced in the CV function to allow elimination (or at least significant reduction) of boundary effects from the estimate of $m(t_i)$. As noted in the definition of num.ln, the estimate of $\beta$ in the CV function is obtained from all data while, once $\beta$ is estimated, $2l_{n} + 1$ observations around each $t_i$ are eliminated to estimate $m(t_i)$ in the CV function. Actually, the estimate of $\beta$ to be used in time $i$ in the CV function could be done eliminating such $2l_{n} + 1$ observations too; that possibility was not implemented because both their computational cost and the known fact that the estimate of $\beta$ is quite insensitive to the bandwidth selection. The implemented procedure generalizes that one in expression (8) in Aneiros-Perez and Quintela-del-Rio (2001) by including a weight function (see above) and allowing two smoothing parameters instead of only one (see Aneiros-Perez et al., 2004).

References

Aneiros-Perez, G., Gonzalez-Manteiga, W. and Vieu, P. (2004) Estimation and testing in a partial linear regression under long-memory dependence. Bernoulli 10, 49-78. Aneiros-Perez, G. and Quintela-del-Rio, A. (2001) Modified cross-validation in semiparametric regression models with dependent errors. Comm. Statist. Theory Methods 30, 289-307. Chu, C-K and Marron, J.S. (1991) Comparison of two bandwidth selectors with dependent errors. The Annals of Statistics 19, 1906-1918.

See Also

Other related functions are: plrm.beta, plrm.est, plrm.gcv, np.est, np.gcv and np.cv.

Examples

Run this code
# EXAMPLE 1: REAL DATA
data(barnacles1)
data <- as.matrix(barnacles1)
data <- diff(data, 12)
data <- cbind(data,1:nrow(data))

aux <- plrm.cv(data, step.ln=1, num.ln=2)
aux$bh.opt
plot.ts(aux$CV[,-2,])

par(mfrow=c(2,1))
plot(aux$b.seq,aux$CV[,-2,1], xlab="h", ylab="CV", type="l", main="ln=0")
plot(aux$b.seq,aux$CV[,-2,2], xlab="h", ylab="CV", type="l", main="ln=1")



# EXAMPLE 2: SIMULATED DATA
## Example 2a: independent data

set.seed(1234)
# We generate the data
n <- 100
t <- ((1:n)-0.5)/n
beta <- c(0.05, 0.01)
m <- function(t) {0.25*t*(1-t)}
f <- m(t)

x <- matrix(rnorm(200,0,1), nrow=n)
sum <- x%*%beta
epsilon <- rnorm(n, 0, 0.01)
y <-  sum + f + epsilon
data_ind <- matrix(c(y,x,t),nrow=100)

# We apply the function
a <-plrm.cv(data_ind)
a$CV.opt

CV <- a$CV
h <- a$h.seq
plot(h, CV,type="l")


## Example 2b: dependent data and ln.0 > 0

set.seed(1234)
# We generate the data
x <- matrix(rnorm(200,0,1), nrow=n)
sum <- x%*%beta
epsilon <- arima.sim(list(order = c(1,0,0), ar=0.7), sd = 0.01, n = n)
y <-  sum + f + epsilon
data_dep <- matrix(c(y,x,t),nrow=100)

# We apply the function
a <-plrm.cv(data_dep, ln.0=2)
a$CV.opt

CV <- a$CV
h <- a$h.seq
plot(h, CV,type="l")

Run the code above in your browser using DataLab