plrm.cv: Cross-validation bandwidth selection in PLR models

Description

From a sample ${(Y_i, X_{i1}, ..., X_{ip}, t_i): i=1,...,n}$, this routine computes, for each $l_n$ considered, an optimal pair of bandwidths for estimating the regression function of the model $$Y_i= X_{i1}*\beta_1 +...+ X_{ip}*\beta_p + m(t_i) + \epsilon_i,$$ where $$\beta = (\beta_1,...,\beta_p)$$ is an unknown vector parameter and $$m(.)$$ is a smooth but unknown function. The random errors, $\epsilon_i$, are allowed to be time series. The optimal pair of bandwidths, (b.opt, h.opt), is selected by means of the leave-($2l_n + 1$)-out cross-validation procedure. The bandwidth b.opt is used in the estimate of $\beta$, while the pair of bandwidths (b.opt, h.opt) is considered in the estimate of $m$. Kernel smoothing, combined with ordinary least squares estimation, is used.

Usage

plrm.cv(data = data, b.equal.h = TRUE, b.seq=NULL, h.seq=NULL, 
num.b = NULL, num.h = NULL, w = NULL, num.ln = 1, ln.0 = 0, 
step.ln = 2, estimator = "NW", kernel = "quadratic")

Value

bh.opt: dataframe containing, for each ln considered, the selected value for (b,h).
CV.opt: CV.opt[k] is the minimum value of the CV function when de k-th value of ln is considered.
CV: an array containing the values of the CV function for each pair of bandwidths and ln considered.
b.seq: sequence of considered bandwidths, b, in the CV function for estimating $\beta$.
h.seq: sequence of considered bandwidths, h, in the pair of bandwidths (b, h) used in the CV function for estimating $m$.
w: support interval of the weigth function in the CV function.

Arguments

data

data[,1] contains the values of the response variable, $Y$;

data[, 2:(p+1)] contains the values of the "linear" explanatory variables,

$X_1, ..., X_p$;

data[, p+2] contains the values of the "nonparametric" explanatory variable, $t$.

b.equal.h

if TRUE (the default), the same bandwidth is used for estimating both $\beta$ and $m$.

b.seq

sequence of considered bandwidths, b, in the CV function for estimating $\beta$. If NULL (the default), num.b equidistant values between zero and a quarter of the range of ${t_i}$ are considered.

h.seq

sequence of considered bandwidths, h, in the pair of bandwidths (b, h) used in the CV function for estimating $m$. If NULL (the default), num.h equidistant values between zero and a quarter of the range of $t_i$ are considered.

num.b

number of values used to build the sequence of considered bandwidths for estimating $\beta$. If b.seq is not NULL, num.b=length(b.seq). Otherwise, if both num.b and num.h are NULL (the default), num.b=50 is considered; if num.b is NULL (the default) but num.h is not NULL, then num.b=num.h is considered; if b.equal.h=TRUE (the default) and both num.b and num.h are not NULL and different, the maximum value of num.b and num.h is considered for both.

num.h

pairs of bandwidths (b, h) are used for estimating $m$, num.h being the number of values considered for h. If h.seq is not NULL, num.h=length(h.seq). Otherwise, if both num.b and num.h are NULL (the default), num.h=50 is considered; if num.h is NULL (the default) but num.b is not NULL, num.h=num.b is considered; if b.equal.h=TRUE (the default) and both num.b and num.h are not NULL and different, the maximum value of num.b and num.h is considered for both.

w

support interval of the weigth function in the CV function. If NULL (the default), $(q_{0.1}, q_{0.9})$ is considered, where $q_p$ denotes the quantile of order $p$ of ${t_i}$.

num.ln

number of values for $l_n$: after estimating $\beta$, $2l_{n} + 1$ observations around each point $t_i$ are eliminated to estimate $m(t_i)$ in the CV function. The default is 1.

ln.0

minimum value for $l_n$. The default is 0.

step.ln

distance between two consecutives values of $l_n$. The default is 2.

estimator

allows us the choice between “NW” (Nadaraya-Watson) or “LLP” (Local Linear Polynomial). The default is “NW”.

kernel

allows us the choice between “gaussian”, “quadratic” (Epanechnikov kernel), “triweight” or “uniform” kernel. The default is “quadratic”.

Author

German Aneiros Perez ganeiros@udc.es

Ana Lopez Cheda ana.lopez.cheda@udc.es

Details

A weight function (specifically, the indicator function 1$_{[w[1] , w[2]]}$) is introduced in the CV function to allow elimination (or at least significant reduction) of boundary effects from the estimate of $m(t_i)$.

As noted in the definition of num.ln, the estimate of $\beta$ in the CV function is obtained from all data while, once $\beta$ is estimated, $2l_{n} + 1$ observations around each $t_i$ are eliminated to estimate $m(t_i)$ in the CV function. Actually, the estimate of $\beta$ to be used in time $i$ in the CV function could be done eliminating such $2l_{n} + 1$ observations too; that possibility was not implemented because both their computational cost and the known fact that the estimate of $\beta$ is quite insensitive to the bandwidth selection.

The implemented procedure generalizes that one in expression (8) in Aneiros-Perez and Quintela-del-Rio (2001) by including a weight function (see above) and allowing two smoothing parameters instead of only one (see Aneiros-Perez et al., 2004).

References

Aneiros-Perez, G., Gonzalez-Manteiga, W. and Vieu, P. (2004) Estimation and testing in a partial linear regression under long-memory dependence. Bernoulli 10, 49-78.

Aneiros-Perez, G. and Quintela-del-Rio, A. (2001) Modified cross-validation in semiparametric regression models with dependent errors. Comm. Statist. Theory Methods 30, 289-307.

Chu, C-K and Marron, J.S. (1991) Comparison of two bandwidth selectors with dependent errors. The Annals of Statistics 19, 1906-1918.

Examples

Run this code

# EXAMPLE 1: REAL DATA
data(barnacles1)
data <- as.matrix(barnacles1)
data <- diff(data, 12)
data <- cbind(data,1:nrow(data))

aux <- plrm.cv(data, step.ln=1, num.ln=2)
aux$bh.opt
plot.ts(aux$CV[,-2,])

par(mfrow=c(2,1))
plot(aux$b.seq,aux$CV[,-2,1], xlab="h", ylab="CV", type="l", main="ln=0")
plot(aux$b.seq,aux$CV[,-2,2], xlab="h", ylab="CV", type="l", main="ln=1")



# EXAMPLE 2: SIMULATED DATA
## Example 2a: independent data

set.seed(1234)
# We generate the data
n <- 100
t <- ((1:n)-0.5)/n
beta <- c(0.05, 0.01)
m <- function(t) {0.25*t*(1-t)}
f <- m(t)

x <- matrix(rnorm(200,0,1), nrow=n)
sum <- x%*%beta
epsilon <- rnorm(n, 0, 0.01)
y <-  sum + f + epsilon
data_ind <- matrix(c(y,x,t),nrow=100)

# We apply the function
a <-plrm.cv(data_ind)
a$CV.opt

CV <- a$CV
h <- a$h.seq
plot(h, CV,type="l")


## Example 2b: dependent data and ln.0 > 0

set.seed(1234)
# We generate the data
x <- matrix(rnorm(200,0,1), nrow=n)
sum <- x%*%beta
epsilon <- arima.sim(list(order = c(1,0,0), ar=0.7), sd = 0.01, n = n)
y <-  sum + f + epsilon
data_dep <- matrix(c(y,x,t),nrow=100)

# We apply the function
a <-plrm.cv(data_dep, ln.0=2)
a$CV.opt

CV <- a$CV
h <- a$h.seq
plot(h, CV,type="l")

Run the code above in your browser using DataLab