etree: Energy Tree

Description

Fits an Energy Tree for classification or regression.

Usage

etree(
  response,
  covariates,
  weights = NULL,
  minbucket = 5,
  alpha = 0.05,
  R = 1000,
  split_type = "coeff",
  coeff_split_type = "test",
  p_adjust_method = "fdr",
  random_covs = NULL
)

Value

An object of class "etree", "constparty", and "party". It stores all the information about the fitted tree. Its elements can be individually accessed using the $ operator. Their names and content are the following:

node: a partynode object representing the basic structure of the tree;
data: a list containing the data used for the fitting process. Traditional covariates are included in their original form, while structured covariates are stored in the form of components if split_type = "coeff" or as a factor whose levels go from 1 to the total number of observations if split_type = "cluster";
fitted: a data.frame whose number of rows coincides with the sample size. It includes the fitted terminal node identifiers (in "(fitted)") and the response values of all observations (in "(response)");
terms: a terms object;
names (optional): names of the nodes in the tree. They can be set using a character vector: if its length is smaller than the number of nodes, the remaining nodes have missing names; if its length is larger, exceeding names are ignored.

Arguments

response

Response variable, an object of class either "factor" or "numeric" (for classification and regression, respectively).

covariates

Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:

Numeric: numeric or integer vectors;
Nominal: factors;
Functions: objects of class "fdata";
Graphs: (lists of) objects of class "igraph".

Each element (i.e., variable) in the covariates list must have the same length(), which corresponds to the sample size.

weights

Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1.

minbucket

Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5.

alpha

Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05.

R

Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000.

split_type

Splitting method used when the selected covariate is structured. It has two possible values: "coeff" for feature vector extraction, and "cluster" for clustering. See Details for further information.

coeff_split_type

Method to select the split point for the chosen component when the selected covariate is structured and split_type = "coeff". It has two possible values: "test", in which case Energy tests of independence are used, and "traditional", to employ traditional methods (Gini index for classification and RSS for regression). See Details for further information.

p_adjust_method

Multiple-testing adjustment method for P-values, which can be set to any of the values provided by p.adjust.methods. Default is "fdr" for False Discovery Rate.

random_covs

Size of the random subset of covariates to choose from at each split. If set to NULL (default), all the covariates are considered each time.

Details

etree() is the main function of the homonym package. It allows implementing Energy Trees by simply specifying the response variable, the set of covariates, and possibly some other parameters. The function is specified in the same way regardless of the task type: the choice between classification and regression is automatically made depending on the nature of the response variable.

Energy Trees (Giubilei et al., 2022) are a recursive partitioning tree-based model built upon Conditional Trees (Hothorn et al., 2006). At each step of Energy Trees' iterative procedure, an Energy test of independence (Szekely et al., 2007) is performed between the response variable and each of the J covariates. If the test of global independence (defined as the intersection of the J tests of partial independence) is not rejected at the significance level set by alpha, the recursion is stopped; otherwise, the covariate most associated with the response in terms of P-value is selected for splitting. When the covariate is traditional (i.e, numeric or nominal), an Energy test of independence is performed for each possible split point, and the one yielding the strongest association with the response is chosen. When the selected covariate is structured, the split procedure is defined by the value of split_type, and possibly by that of coeff_split_type.

split_type specifies the splitting method for structured covariates. It has two possible values:

"coeff": in this case, feature vector extraction is used to transform the structured selected covariate into a set of numeric components using a representation that is specific to its type. Available transformations of such a kind are cubic B-spline expansions for functional data and shell distributions (Carmi et al., 2007) for graphs - obtained through k-cores (Seidman, 1983), s-cores (Eidsaa and Almaas, 2013), and d-cores (Giatsidis et al., 2013), for binary, weighted, and directed graphs, respectively. Then, the component most associated with the response is selected using Energy tests of independence (Szekely et al., 2007), and the split point for that component is chosen using the method defined by coeff_split_type;
"cluster": in this case, the observed values for the structured selected covariate are used within a Partitioning Around Medoids (Kaufmann and Rousseeuw, 1987) step to split observations into the two kid nodes. Medoids calculation and units assignment are performed using pam(). Distances are specific to each type of variable (see dist_comp() for details).

coeff_split_type defines the method to select the split point for the chosen component of the selected structured covariate if and only if split_type = "coeff". It has two possible values:

"test": an Energy test of independence (Szekely et al., 2007) is performed for each possible split point of the chosen component, and the one yielding the strongest association with the response is selected;
"traditional": the split point for the chosen component is selected as the one minimizing the Gini index (for classification) or the RSS (for regression) in the two kid nodes.

References

R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf.

S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, and E. Shir (2007). A model of internet topology using k-shell decomposition. Proceedings of the National Academy of Sciences, 104(27):11150-11154.

M. Eidsaa and E. Almaas (2013). S-core network decomposition: A generalization of k-core analysis to weighted networks. Physical Review E, 88(6):062819.

C. Giatsidis, D. M. Thilikos, and M. Vazirgiannis (2013). D-cores: measuring collaboration of directed graphs based on degeneracy. Knowledge and information systems, 35(2):311-343.

T. Hothorn, K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651-674.

L. Kaufmann and P. Rousseeuw (1987). Clustering by means of medoids. Data Analysis based on the L1-Norm and Related Methods, pages 405-416.

S. B. Seidman (1983). Network structure and minimum degree. Social networks, 5(3):269-287.

G. J. Szekely, M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769-2794.

Examples

Run this code

# \donttest{
## Covariates
nobs <- 100
cov_num <- rnorm(nobs)
cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5))
cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2))
cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1)
cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun)

## Response variable(s)
resp_reg <- cov_num ^ 2
y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0)
resp_cls <- factor(y)

## Regression ##
etree_fit <- etree(response = resp_reg, covariates = cov_list)
print(etree_fit)
plot(etree_fit)
mean((resp_reg - predict(etree_fit)) ^ 2)

## Classification ##
etree_fit <- etree(response = resp_cls, covariates = cov_list)
print(etree_fit)
plot(etree_fit)
table(resp_cls, predict(etree_fit))
# }

Run the code above in your browser using DataLab