High level function to correct the computed scores in a hierarchy according to the chosen ensemble algorithm
Do.TPR.DAG(threshold = seq(from = 0.1, to = 0.9, by = 0.1),
weight = seq(from = 0.1, to = 0.9, by = 0.1), kk = 5, folds = 5,
seed = 23, norm = TRUE, norm.type = NULL, positive = "children",
bottomup = "threshold.free", rec.levels = seq(from = 0.1, to = 1, by =
0.1), n.round = 3, f.criterion = "F", metric = NULL,
flat.file = flat.file, ann.file = ann.file, dag.file = dag.file,
flat.dir = flat.dir, ann.dir = ann.dir, dag.dir = dag.dir,
hierScore.dir = hierScore.dir, perf.dir = perf.dir)
range of threshold values to be tested in order to find the best threshold (def:
from:0.1
,
to:0.9
, by:0.1
).
The denser the range is, the higher the probability to find the best threshold is, but obviously the execution time will be higher.
Set the parameter threshold
only for the variants that requiring a threshold for the positive nodes selection,
otherwise set the parameter threshold
to zero
range of weight values to be tested in order to find the best weight (def:
from:0.1
, to:0.9
, by:0.1
).
The denser the range is, the higher the probability to find the best threshold is, but obviously the execution time will be higher.
Set the parameter weight
only for the weighted variants, otherwise set the parameter weight
to zero
number of folds of the cross validation (def: kk=5
) on which tuning the parameters threshold
and weight
and
number of folds of the cross validation on which computing the performance metrics averaged across folds (def. 5
).
If folds=NULL
, the performance metrics are computed one-shot, otherwise the performance metrics are averaged across folds.
initialization seed for the random generator to create folds (def. 23
). If NULL
folds are generated without seed
initialization. The parameter seed
controls both the parameter kk
and the parameter folds
.
boolean value: should the flat scores matrix be normalized?
TRUE
(def.
): the flat scores matrix has been already normalized in according to a normalization method;
FALSE
: the flat scores matrix has not been normalized yet. See the parameter norm.type
to set the on the fly
normalization method to apply among those possible.
can be one of the following three values:
NULL
(def.
): set norm.type
to NULL
if and only if the parameter norm
is set to TRUE
;
MaxNorm
: each score is divided for the maximum of each class;
Qnorm
: quantile normalization. preprocessCore package is used.
choice of the positive nodes to be considered in the bottom-up strategy. Can be one of the following values:
children
(def.
): for each node are considered its positive children;
descendants
: for each node are considered its positive descendants;
strategy to enhance the flat predictions by propagating the positive predictions from leaves to root. It can be one of the following values:
threshold.free
(def.
): positive nodes are selected on the basis of the threshold.free
strategy (def.
);
threshold
: positive nodes are selected on the basis of the threshold
strategy;
weighted.threshold.free
: positive nodes are selected on the basis of the weighted.threshold.free
strategy;
weighted.threshold
: positive nodes are selected on the basis of the weighted.threshold
strategy;
tau
: positive nodes are selected on the basis of the tau
strategy.
NOTE: tau
is only a DESCENS
variants. If you use tau
strategy you must set the parameter positive=descendants
;
a vector with the desired recall levels (def:
from:0.1
, to:0.9
, by:0.1
) to compute the
the Precision at fixed Recall level (PXR)
number of rounding digits to be applied to the hierarchical scores matrix (def. 3
). It is used for choosing
the best threshold on the basis of the best F-measure
character. Type of F-measure to be used to select the best F-measure. Two possibilities:
F
(def.
): corresponds to the harmonic mean between the average precision and recall
avF
: corresponds to the per-example F-score
averaged across all the examples
a string character specifying the performance metric on which to maximize the parametric ensemble variant. It can be one of the following values:
PRC
: the parametric ensemble variant is maximized on the basis of AUPRC (AUPRC
);
FMAX
: the parametric ensemble variant is maximized on the basis of Fmax (Multilabel.F.measure
;
NULL
: on the threshold.free
variant none parameter optimization is needed, since the variant is non-parametric.
So, if bottomup=threshold.free
set metric=NULL
(def.
).
name of the file containing the flat scores matrix to be normalized or already normalized (without rda extension)
name of the file containing the the label matrix of the examples (without rda extension)
name of the file containing the graph that represents the hierarchy of the classes (without rda extension)
relative path where flat scores matrix is stored
relative path where annotation matrix is stored
relative path where graph is stored
relative path where the hierarchical scores matrix must be stored
relative path where the performance measures must be stored
Two rda
files stored in the respective output directories:
Hierarchical Scores Results
: a matrix with examples on rows and classes on columns representing the computed hierarchical scores
for each example and for each considered class. It is stored in the hierScore.dir
directory.
Performance Measures
: flat and hierarchical performace results:
It is stored in the perf.dir
directory.
The parametric hierarchical ensemble variants are cross-validated by maximizing in according to the metric
chosen in the parameter metric
, that is F-measure (Multilabel.F.measure
) or AUPRC (AUPRC
).
The function checks if the number of classes between the flat scores matrix and the annotations matrix mismatched. If so, the number of terms of the annotations matrix is shrunk to the number of terms of the flat scores matrix and the corresponding subgraph is computed as well. N.B.: it is supposed that all the nodes of the subgraph are accessible from the root.
# NOT RUN {
data(graph);
data(scores);
data(labels);
if (!dir.exists("data")){
dir.create("data");
}
if (!dir.exists("results")){
dir.create("results");
}
save(g,file="data/graph.rda");
save(L,file="data/labels.rda");
save(S,file="data/scores.rda");
dag.dir <- flat.dir <- ann.dir <- "data/";
hierScore.dir <- perf.dir <- "results/";
dag.file <- "graph";
flat.file <- "scores";
ann.file <- "labels";
threshold <- weight <- 0;
norm.type <- "MaxNorm";
positive <- "children";
bottomup <- "threshold.free";
rec.levels <- seq(from=0.1, to=1, by=0.1);
Do.TPR.DAG(threshold=threshold, weight=weight, kk=5, folds=5, seed=23, norm=FALSE,
norm.type=norm.type, positive=positive, bottomup=bottomup, n.round=3, f.criterion="F",
metric=NULL, rec.levels=rec.levels, flat.file=flat.file, ann.file=ann.file,
dag.file=dag.file, flat.dir=flat.dir, ann.dir=ann.dir, dag.dir=dag.dir,
hierScore.dir=hierScore.dir, perf.dir=perf.dir);
# }
Run the code above in your browser using DataLab