`rf_pred()` constructs a random forest model using optimal hyperparameters previously determined by out-of-bag evaluation to estimate the targeted time series.
rf_pred(
df,
colname_label,
vctr_colname_feature = NULL,
min_nodesize,
m_try,
subsample,
do_outlier_detection = TRUE,
frac_train = 0.75,
n_tree = 500,
ran_seed = 12345,
coef_iqr = 1.5,
label_err = -9999
)A list with two elements. The first element `mse` is the mean squared error between predicted and original values in the test data set. The second element `stats` is a data frame, and its contents differ depending on `do_outlier_detection`.
If `do_outlier_detection` is `TRUE`, the data frame outputs with columns below:
* The first column, `cleaned`, gives the cleaned time series after replacing the detected outliers with the value specified by `label_err`.
* The second column, `flag_out`, gives a flag variable time series indicating the status of the cleaned time series (0: the input data point is not originally missing and not detected as an outlier; 1: the input data point is not originally missing but detected as an outlier; 2: the input data point is originally missing).
* The third column, `med`, gives the ensemble median time series calculated from estimated values at each time point for each tree in the constructed random forest.
* The fourth column, `q1`, gives the ensemble Q1 (first quartile) time series calculated from estimated values at each time point for each tree in the constructed random forest.
* The fifth column, `q3`, gives the ensemble Q3 (third quartile) time series calculated from estimated values at each time point for each tree in the constructed random forest.
If `do_outlier_detection` is `FALSE`, the data frame outputs with columns below:
* The first column, `gapfilled`, gives the gap-filled time series, where missing values are replaced with the predicted values from the random forest model.
* The second column, `avg_predicted`, gives the ensemble mean time series calculated from estimated values at each time point for each tree in the constructed random forest.
* The third column, `sd_predicted`, gives the ensemble mean time series calculated from estimated values at each time point for each tree in the constructed random forest.
A data frame including label (explained variable) and feature (explanatory variables) time series for model input. It is acceptable to include missing values in each column.
A character representing the name of the column for the label time series.
A vector of characters indicating the name of the feature time series columns used in constructing a random forest model. If `NULL` (default), all columns excluding the label column specified as `colname_label` in the input data frame are used as feature columns.
A positive integer indicating the minimal node size (the minimum number of data points included in each leaf node). This hyperparameter should be previously optimized by out-of-bag evaluation.
A positive integer indicating the number of features to be used in splitting each node. This hyperparameter should be previously optimized by out-of-bag evaluation.
A numerical value between 0 and 1, indicating the fraction of input training data points to be sampled in constructing the random forest. This hyperparameter should be previously optimized by out-of-bag evaluation.
A boolean. If `TRUE` (default), this function predicts the time series to detect outliers; else, this function estimates the time series to fill gaps.
A numerical value between 0 and 1, defining the fraction of data points to be categorized as training data for the random forest model construction. The other data points are classified as test data. Default is 0.75.
An integer representing the number of trees in the random forest. Default is 500.
An integer representing the random seed for the random forest model construction. Default is 12345.
A positive value defining a multiplier of the interquartile range (IQR). If the value to be checked is less than Q1 (first quartile) - `coef_iqr` * IQR or more than Q3 (third quartile) + `coef_iqr` * IQR, the value is detected as a random forest outlier. Default is 1.5.
A numeric value representing a missing value in the input vector(s). Default is -9999.
Yoshiaki Hata