COR: Correlation-Based Clustering Tree

Description

Builds a binary tree for clustering time series data based on covariates. The splitting criterion minimizes the average absolute Pearson correlation between time series across child nodes.

Usage

COR(X, Y, control = list())

Value

An object of class "FACT" containing:

frame: A data frame describing the tree structure, with one row per node containing split variable, split value, test statistic, and p-value. A smaller test statistic suggests more heterogeneity between child nodes.
membership: An integer vector of length \(N\) indicating the terminal node assignment for each observation.
control: The control parameters used.
terms: Metadata including covariate names and data dimensions.

Arguments

X

A numeric matrix of covariates with dimension \(N \times p\), where \(N\) is the number of time series and \(p\) is the number of features. Each row corresponds to the covariates for one time series.

Y

A numeric matrix of time series data with dimension \(T \times N\), where \(T\) is the length of each series. Each column represents one time series.

control

A list of control parameters for tree construction:

minsplit: Minimum number of observations required to attempt a split. Default: 90.

minbucket

Minimum number of observations in any terminal node. Default: 30.

alpha

Significance level for the permutation test. Default: 0.01.

R

Number of permutations for the hypothesis test. Default: 199.

parallel

Logical; if TRUE, enables parallel computation for permutation tests. Default: FALSE.

n_cores

Number of cores for parallel processing. If NULL (default), uses detectCores() - 1.

Details

The algorithm recursively partitions the data by finding splits that minimize the average absolute correlation between time series in different child nodes. Statistical significance of each split is assessed via a permutation test.

At each node, the optimal split is found by exhaustively searching over all covariates and candidate split points. The permutation test shuffles the time series labels to generate a null distribution for the test statistic.

Examples

Run this code

# Generate synthetic data
data <- gendata(seed = 42, T = 100, N = c(50, 50, 50, 50))

# Build correlation-based tree
result <- COR(data$X, data$Y, control = list(R = 99, alpha = 0.05))

# Examine results
print(result)
plot(result)
table(result$membership, data$group)