RFGLS_estimate_timeseries: Function for estimation in time-series data with RF-GLS

Description

The function RFGLS_estimate_spatial fits univariate non-linear regression models for time-series data using a RF-GLS in Saha et al. 2020. RFGLS_estimate_spatial uses the sparse Cholesky representation corresponsinding to AR(q) process. The fitted Random Forest (RF) model is used later for prediction via the RFGLS-predict.

Some code blocks are borrowed from the R packages: spNNGP: Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes
https://CRAN.R-project.org/package=spNNGP and randomForest: Breiman and Cutler's Random Forests for Classification and Regression
https://CRAN.R-project.org/package=randomForest .

Usage

RFGLS_estimate_timeseries(y, X, Xtest = NULL, nrnodes = NULL,
                          nthsize = 20, mtry = 1,
                          pinv_choice = 1, n_omp = 1,
                          ntree = 50, h = 1, lag_params = 0.5,
                          variance = 1,
                          param_estimate = FALSE,
                          verbose = FALSE)

Value

A list comprising:

P_matrix: an \(n \times ntree\) matrix of zero indexed resamples. t-th column denote the \(n\) resamples used in the t-th tree.
predicted_matrix: an \(ntest \times ntree\) matrix of predictions. t-th column denote the predictions at \(ntest\) datapoints obtained from the t-th tree.
predicted: preducted values at the \(ntest\) prediction points. Average (rowMeans) of the treewise predctions in predicted_matrix,
X: the matrix X.
y: the vector y.
RFGLS_Object: object required for prediction.

Arguments

y: an \(n\) length vector of response at the observed time points.
X: an \(n \times p\) matrix of the covariates in the observation time points.
Xtest: an \(ntest \times p\) matrix of covariates for prediction. Its Structure should be identical (including intercept) with that of covariates provided for estimation purpose in X. If NULL, will use X as Xtest. Default value is NULL.
nrnodes: the maximum number of nodes a tree can have. Default choice leads to the deepest tree contigent on nthsize. For significantly large \(n\), one needs to bound it for growing shallow trees which trades off efficiency for computation time.
nthsize: minimum size of leaf nodes. We recommend not setting this value too small, as that will lead to very deep trees that takes a lot of time to be built and can produce unstable estimaes. Default value is 20.
mtry: number of variables randomly sampled at each partition as a candidate split direction. We recommend using the value p/3 where p is the number of variables in X. Default value is 1.
pinv_choice: dictates the choice of method for obtaining the pseudoinverse involved in the cost function and node representative evaluation. if pinv_choice = 0, SVD is used (slower but more stable), if pinv_choice = 1, orthogonal decomposition (faster, may produce unstable results if nthsize is too low) is used. Default value is 1.
n_omp: number of threads to be used, value can be more than 1 if source code is compiled with OpenMP support. Default is 1.
ntree: number of trees to be grown. This value should not be too small. Default value is 50.
h: number of core to be used in parallel computing setup for bootstrap samples. If h = 1, there is no parallelization. Default value is 1.
lag_params: \(q\) length vector of AR coefficients. If the parameters need to be estimated from AR(q) process, should be any numeric vector of length q. For notations please see arima. Default value is 0.5.
variance: variance of the white noise in temporal error. The function estimate is not affected by this. Default value is 1.
param_estimate: if TRUE, using the residuals obtained from fitting a classical RF default options and nodesize = nthsize, will estimate the coefficeints corresponding to \(AR(q)\) from arima with the option, include.mean = FALSE. Default value is FALSE.
verbose: if TRUE, model specifications along with information regarding OpenMP support and progress of the algorithm is printed to the screen. Otherwise, nothing is printed to the screen. Default value is FALSE.

Author

Arkajyoti Saha arkajyotisaha93@gmail.com,
Sumanta Basu sumbose@cornell.edu,
Abhirup Datta abhidatta@jhu.edu

References

Saha, A., Basu, S., & Datta, A. (2020). Random Forests for dependent data. arXiv preprint arXiv:2007.15421.

Saha, A., & Datta, A. (2018). BRISC: bootstrap for rapid inference on spatial covariances. Stat, e184, DOI: 10.1002/sta4.184.

Andy Liaw, and Matthew Wiener (2015). randomForest: Breiman and Cutler's Random Forests for Classification and Regression. R package version 4.6-14.
https://CRAN.R-project.org/package=randomForest

Andrew Finley, Abhirup Datta and Sudipto Banerjee (2017). spNNGP: Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes. R package version 0.1.1. https://CRAN.R-project.org/package=spNNGP

Examples

Run this code


rmvn <- function(n, mu = 0, V = matrix(1)){
  p <- length(mu)
  if(any(is.na(match(dim(V),p))))
    stop("Dimension not right!")
  D <- chol(V)
  t(matrix(rnorm(n*p), ncol=p)%*%D + rep(mu,rep(n,p)))
}

set.seed(2)
n <- 200
x <- as.matrix(rnorm(n),n,1)

sigma.sq <- 1
rho <- 0.5

set.seed(3)
b <- rho
s <- sqrt(sigma.sq)
eps = arima.sim(list(order = c(1,0,0), ar = b),
                n = n, rand.gen = rnorm, sd = s)

y <- eps + 10*sin(pi * x)

estimation_result <- RFGLS_estimate_timeseries(y, x, ntree = 10)

Run the code above in your browser using DataLab