Fits an Energy Tree for classification or regression.
etree(
response,
covariates,
weights = NULL,
minbucket = 5,
alpha = 0.05,
R = 1000,
split_type = "coeff",
coeff_split_type = "test",
p_adjust_method = "fdr",
random_covs = NULL
)
An object of class "etree"
, "constparty"
, and "party"
.
It stores all the information about the fitted tree. Its elements can be
individually accessed using the $
operator. Their names and content
are the following:
node
: a partynode
object representing
the basic structure of the tree;
data
: a list
containing the data used for the fitting
process. Traditional covariates are included in their original form, while
structured covariates are stored in the form of components if
split_type = "coeff"
or as a factor
whose levels go from 1 to
the total number of observations if split_type = "cluster"
;
fitted
: a data.frame
whose number of rows coincides with
the sample size. It includes the fitted terminal node identifiers (in
"(fitted)"
) and the response values of all observations (in
"(response)"
);
terms
: a terms
object;
names
(optional): names of the nodes in the tree. They can be
set using a character
vector: if its length is smaller than the number
of nodes, the remaining nodes have missing names; if its length is larger,
exceeding names are ignored.
Response variable, an object of class either
"factor"
or "numeric"
(for classification and regression,
respectively).
Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:
Numeric: numeric or integer vectors;
Nominal: factors;
Functions: objects of class "fdata"
;
Graphs: (lists of) objects of class "igraph"
.
Each element (i.e., variable) in the covariates list must have the same
length()
, which corresponds to the sample size.
Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1.
Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5.
Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05.
Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000.
Splitting method used when the selected covariate is
structured. It has two possible values: "coeff"
for feature vector
extraction, and "cluster"
for clustering. See Details for further
information.
Method to select the split point for the chosen
component when the selected covariate is structured and split_type =
"coeff"
. It has two possible values: "test"
, in which case Energy
tests of independence are used, and "traditional"
, to employ
traditional methods (Gini index for classification and RSS for regression).
See Details for further information.
Multiple-testing adjustment method for P-values,
which can be set to any of the values provided by
p.adjust.methods
. Default is "fdr"
for False
Discovery Rate.
Size of the random subset of covariates to choose from
at each split. If set to NULL
(default), all the covariates are
considered each time.
etree()
is the main function of the homonym package. It allows
implementing Energy Trees by simply specifying the response variable, the set
of covariates, and possibly some other parameters. The function is specified
in the same way regardless of the task type: the choice between
classification and regression is automatically made depending on the nature
of the response variable.
Energy Trees (Giubilei et al., 2022) are a recursive partitioning tree-based
model built upon
Conditional Trees (Hothorn et al., 2006). At each step of Energy Trees'
iterative procedure, an Energy test of independence (Szekely et al., 2007) is
performed between the response variable and each of the J covariates. If the
test of global independence (defined as the intersection of the J tests of
partial independence) is not rejected at the significance level set by
alpha
, the recursion is stopped; otherwise, the covariate most
associated with the response in terms of P-value is selected for splitting.
When the covariate is traditional (i.e, numeric or nominal), an Energy test
of independence is performed for each possible split point, and the one
yielding the strongest association with the response is chosen. When the
selected covariate is structured, the split procedure is defined by the value
of split_type
, and possibly by that of coeff_split_type
.
split_type
specifies the splitting method for structured covariates.
It has two possible values:
"coeff"
: in this case, feature vector extraction is used to
transform the structured selected covariate into a set of numeric components
using a representation that is specific to its type. Available
transformations of such a kind are cubic B-spline expansions for functional
data and shell distributions (Carmi et al., 2007) for graphs - obtained
through k-cores (Seidman, 1983), s-cores (Eidsaa and Almaas, 2013), and
d-cores (Giatsidis et al., 2013), for binary, weighted, and directed graphs,
respectively. Then, the component most associated with the response is
selected using Energy tests of independence (Szekely et al., 2007), and the
split point for that component is chosen using the method defined by
coeff_split_type
;
"cluster"
: in this case, the observed values for the structured
selected covariate are used within a Partitioning Around Medoids (Kaufmann
and Rousseeuw, 1987) step to split observations into the two kid nodes.
Medoids calculation and units assignment are performed using
pam()
. Distances are specific to each type of
variable (see dist_comp()
for details).
coeff_split_type
defines the method to select the split point for the
chosen component of the selected structured covariate if and only if
split_type = "coeff"
. It has two possible values:
"test"
: an Energy test of independence (Szekely et al., 2007) is
performed for each possible split point of the chosen component, and the one
yielding the strongest association with the response is selected;
"traditional"
: the split point for the chosen component is
selected as the one minimizing the Gini index (for classification) or the RSS
(for regression) in the two kid nodes.
R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf.
S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, and E. Shir (2007). A model of internet topology using k-shell decomposition. Proceedings of the National Academy of Sciences, 104(27):11150-11154.
M. Eidsaa and E. Almaas (2013). S-core network decomposition: A generalization of k-core analysis to weighted networks. Physical Review E, 88(6):062819.
C. Giatsidis, D. M. Thilikos, and M. Vazirgiannis (2013). D-cores: measuring collaboration of directed graphs based on degeneracy. Knowledge and information systems, 35(2):311-343.
T. Hothorn, K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651-674.
L. Kaufmann and P. Rousseeuw (1987). Clustering by means of medoids. Data Analysis based on the L1-Norm and Related Methods, pages 405-416.
S. B. Seidman (1983). Network structure and minimum degree. Social networks, 5(3):269-287.
G. J. Szekely, M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769-2794.
ctree()
for the partykit
implementation of
Conditional Trees (Hothorn et al., 2006).
# \donttest{
## Covariates
nobs <- 100
cov_num <- rnorm(nobs)
cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5))
cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2))
cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1)
cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun)
## Response variable(s)
resp_reg <- cov_num ^ 2
y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0)
resp_cls <- factor(y)
## Regression ##
etree_fit <- etree(response = resp_reg, covariates = cov_list)
print(etree_fit)
plot(etree_fit)
mean((resp_reg - predict(etree_fit)) ^ 2)
## Classification ##
etree_fit <- etree(response = resp_cls, covariates = cov_list)
print(etree_fit)
plot(etree_fit)
table(resp_cls, predict(etree_fit))
# }
Run the code above in your browser using DataLab