Hidalgo modelThe function fits the Heterogeneous intrinsic dimension algorithm, developed in Allegra et al., 2020. The model is a Bayesian mixture of Pareto distribution with modified likelihood to induce homogeneity across neighboring observations. The model can segment the observations into multiple clusters characterized by different intrinsic dimensions. This permits to capture hidden patterns in the data. For more details on the algorithm, refer to Allegra et al., 2020. For an example of application to basketball data, see Santos-Fernandez et al., 2021.
Hidalgo(
X = NULL,
dist_mat = NULL,
K = 10,
nsim = 5000,
burn_in = 5000,
thinning = 1,
verbose = TRUE,
q = 3,
xi = 0.75,
alpha_Dirichlet = 0.05,
a0_d = 1,
b0_d = 1,
prior_type = c("Conjugate", "Truncated", "Truncated_PointMass"),
D = NULL,
pi_mass = 0.5
)# S3 method for Hidalgo
print(x, ...)
# S3 method for Hidalgo
plot(x, type = c("A", "B", "C"), class = NULL, ...)
# S3 method for Hidalgo
summary(object, ...)
# S3 method for summary.Hidalgo
print(x, ...)
object of class Hidalgo, which is a list containing
cluster_probchains of the posterior mixture weights;
membership_labelschains of the membership labels for all the observations;
id_rawchains of the K intrinsic dimensions
parameters, one per mixture component;
id_postpra chain for each observation, corrected for label switching;
id_summarya matrix containing, for each observation, the value of posterior mean and the 5%, 25%, 50%, 75%, 95% quantiles;
recapa list with the objects and specifications passed to the function used in the estimation.
data matrix with n observations and D variables.
distance matrix computed between the n observations.
integer, number of mixture components.
number of MCMC iterations to run.
number of MCMC iterations to discard as burn-in period.
integer indicating the thinning interval.
logical, should the progress of the sampler be printed?
integer, first local homogeneity parameter. Default is 3.
real number between 0 and 1, second local homogeneity parameter. Default is 0.75.
parameter of the symmetric Dirichlet prior on the mixture weights. Default is 0.05, inducing a sparse mixture. Values that are too small (i.e., lower than 0.005) may cause underflow.
shape parameter of the Gamma prior on d.
rate parameter of the Gamma prior on d.
character, type of Gamma prior on d, can be
"Conjugate"a conjugate Gamma distribution is elicited;
"Truncated"the conjugate Gamma prior is truncated over the
interval (0,D);
"Truncated_PointMass"same as "Truncated", but a
point mass is placed on D, to allow the id to be
identically equal to the nominal dimension.
integer, the maximal dimension of the dataset.
probability placed a priori on D when
Truncated_PointMass is chosen.
object of class Hidalgo, the output of the
Hidalgo() function.
other arguments passed to specific methods.
character that indicates the type of plot that is requested. It can be:
"A"plot the MCMC and the ergodic means NOT corrected for label switching;
"B"plot the posterior mean and median of the id for each observation, after the chains are processed for label switching;
"C"plot the estimated id distributions stratified by the groups specified in the class vector;
factor variable used to stratify observations according to
their the id estimates.
object of class Hidalgo, the output of the
Hidalgo() function.
Allegra M, Facco E, Denti F, Laio A, Mira A (2020). “Data segmentation based on the local intrinsic dimension.” Scientific Reports, 10(1), 1–27. ISSN 20452322, tools:::Rd_expr_doi("10.1038/s41598-020-72222-0"),
Santos-Fernandez E, Denti F, Mengersen K, Mira A (2021). “The role of intrinsic dimension in high-resolution player tracking data – Insights in basketball.” Annals of Applied Statistics - Forthcoming, – ISSN 2331-8422, 2002.04148, tools:::Rd_expr_doi("10.1038/s41598-022-20991-1")
id_by_class and clustering
to understand how to further postprocess the results.
# \donttest{
set.seed(1234)
X <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
X[1:250,] <- X[1:250,] + 4
oracle <- rep(1:2,rep(250,2))
# this is just a short example
# increase the number of iterations to improve mixing and convergence
h_out <- Hidalgo(X, nsim = 500, burn_in = 500)
plot(h_out, type = "B")
id_by_class(h_out, oracle)
# }
Run the code above in your browser using DataLab