Hidalgo: Fit the `Hidalgo` model

Description

The function fits the Heterogeneous intrinsic dimension algorithm, developed in Allegra et al., 2020. The model is a Bayesian mixture of Pareto distribution with modified likelihood to induce homogeneity across neighboring observations. The model can segment the observations into multiple clusters characterized by different intrinsic dimensions. This permits to capture hidden patterns in the data. For more details on the algorithm, refer to Allegra et al., 2020. For an example of application to basketball data, see Santos-Fernandez et al., 2021.

Usage

Hidalgo(
  X = NULL,
  dist_mat = NULL,
  K = 10,
  nsim = 5000,
  burn_in = 5000,
  thinning = 1,
  verbose = TRUE,
  q = 3,
  xi = 0.75,
  alpha_Dirichlet = 0.05,
  a0_d = 1,
  b0_d = 1,
  prior_type = c("Conjugate", "Truncated", "Truncated_PointMass"),
  D = NULL,
  pi_mass = 0.5
)
# S3 method for Hidalgo
print(x, ...)
# S3 method for Hidalgo
plot(x, type = c("A", "B", "C"), class = NULL, ...)
# S3 method for Hidalgo
summary(object, ...)
# S3 method for summary.Hidalgo
print(x, ...)

Value

object of class Hidalgo, which is a list containing

cluster_prob: chains of the posterior mixture weights;
membership_labels: chains of the membership labels for all the observations;
id_raw: chains of the K intrinsic dimensions parameters, one per mixture component;
id_postpr: a chain for each observation, corrected for label switching;
id_summary: a matrix containing, for each observation, the value of posterior mean and the 5%, 25%, 50%, 75%, 95% quantiles;
recap: a list with the objects and specifications passed to the function used in the estimation.

Arguments

X

data matrix with n observations and D variables.

dist_mat

distance matrix computed between the n observations.

K

integer, number of mixture components.

nsim

number of MCMC iterations to run.

burn_in

number of MCMC iterations to discard as burn-in period.

thinning

integer indicating the thinning interval.

verbose

logical, should the progress of the sampler be printed?

q

integer, first local homogeneity parameter. Default is 3.

xi

real number between 0 and 1, second local homogeneity parameter. Default is 0.75.

alpha_Dirichlet

parameter of the symmetric Dirichlet prior on the mixture weights. Default is 0.05, inducing a sparse mixture. Values that are too small (i.e., lower than 0.005) may cause underflow.

a0_d

shape parameter of the Gamma prior on d.

b0_d

rate parameter of the Gamma prior on d.

prior_type

character, type of Gamma prior on d, can be

"Conjugate": a conjugate Gamma distribution is elicited;

"Truncated"

the conjugate Gamma prior is truncated over the interval (0,D);

"Truncated_PointMass"

same as "Truncated", but a point mass is placed on D, to allow the id to be identically equal to the nominal dimension.

integer, the maximal dimension of the dataset.

pi_mass

probability placed a priori on D when Truncated_PointMass is chosen.

object of class Hidalgo, the output of the Hidalgo() function.

...

other arguments passed to specific methods.

type

character that indicates the type of plot that is requested. It can be:

"A": plot the MCMC and the ergodic means NOT corrected for label switching;

"B"

plot the posterior mean and median of the id for each observation, after the chains are processed for label switching;

"C"

plot the estimated id distributions stratified by the groups specified in the class vector;

class

factor variable used to stratify observations according to their the id estimates.

object

object of class Hidalgo, the output of the Hidalgo() function.

References

Allegra M, Facco E, Denti F, Laio A, Mira A (2020). “Data segmentation based on the local intrinsic dimension.” Scientific Reports, 10(1), 1–27. ISSN 20452322, tools:::Rd_expr_doi("10.1038/s41598-020-72222-0"),

Santos-Fernandez E, Denti F, Mengersen K, Mira A (2021). “The role of intrinsic dimension in high-resolution player tracking data – Insights in basketball.” Annals of Applied Statistics - Forthcoming, – ISSN 2331-8422, 2002.04148, tools:::Rd_expr_doi("10.1038/s41598-022-20991-1")

Examples

Run this code

# \donttest{
set.seed(1234)
X            <- replicate(5,rnorm(500))
X[1:250,1:2] <- 0
X[1:250,]    <- X[1:250,] + 4
oracle       <- rep(1:2,rep(250,2))
# this is just a short example
# increase the number of iterations to improve mixing and convergence
h_out        <- Hidalgo(X, nsim = 500, burn_in = 500)
plot(h_out, type =  "B")
id_by_class(h_out, oracle)
# }