TopoCBN: Topological Clustering using Betti Numbers

Description

The function performs unsupervised clustering of multivariate data based on topological data analysis (TDA). The objective is to partition data into non-overlapping clusters, where the definition of a cluster falls under a general framework of density based clustering, e.g., DBSCAN and OPTICS. That is, intuitively the cluster is a subset of points which is path-connected, i.e., any point in the subset can be reached from any other one through a path consisting of points (also belonging to the subset); furthermore, the consecutive points on the path are close enough and their local neighborhoods are similar in shape Islambekov_Gel_2019funtimes. To compare shapes, TopoCBN builds a Vietoris--Rips (VR) filtration upon such neighborhoods around each point and computes topological summaries in the form of the Betti sequences using persistent homology. The closer the Betti sequences to one another for a pair of close-by points, the more similar the shapes of their neighborhoods. Thus, when identifying clusters, TopoCBN utilizes both the distance function and local geometric information around the points. Note that accounting for shape similarity can be viewed as an extension of conventional clustering properties in the density-based clustering framework.

Usage

TopoCBN(data, nKNN, filt_len = 25, dist_matrix = FALSE)

Arguments

data

a point cloud given as an N by d matrix, where N = number of points, d = dimension of Euclidean space or an N by N matrix of pairwise distances.

nKNN

number of k nearest neighbors to take around each point.

filt_len

filtration length (also length of Betti sequences). Default is 25.

dist_matrix

is set to FALSE by default, assuming data is a point cloud. Set dist_matrix = TRUE if data is a matrix of pairwise distances.

Value

A list with the following components:

assignments

cluster labels (vector of length N).

nClust

number of clusters.

cSize

cluster sizes (vector of length nClust).

References

Examples

Run this code

# NOT RUN {
#Example 1:
#Let's import dataset with today's Covid-19 parameters per each state:
data<-covid19us::get_states_current()
#For this example we will keep data for positive cases and deaths today:
data<-data[c(3,9)]
#We also need to replace NA values to integer 0:  
data[is.na(data)] = 0

#Now run CBN:
result <- TopoCBN(data,nKNN=12) # can also try with filt_len=50,75,100

#We can obtain the same results using matrix of pairwise distances:
dMatrix <- as.matrix(dist(data))
result <- TopoCBN(dMatrix,nKNN=12,dist_matrix = TRUE)

#Let's plot the results:
set.seed(365)
distinct_clrs=randomcoloR::distinctColorPalette(result$nClust)
clrs<-distinct_clrs[result$assignments] # distinct colors for clusters
plot(data,col=clrs,pch=20,xlab='x',ylab='y',main = 'TopoCBN') 
print(result)

#We can see that CBN function identified 6 clusters within our dataset.

#Example 2:
#Let's import dataset with air quality level in  Californian metropolitan areas. The three
#columns of the dataset contains indicator of air quality (the lower the better), value 
#added of companies (in thousands of dollars).

data<-as.matrix(Ecdat::Airq[1:3])

#Now apply TopoCBN function to the air quality data:
result <- TopoCBN(data,nKNN=12) # can also try with filt_len=50,75,100

#The same results can be obtained using matrix of pairwise distances:
dMatrix <- as.matrix(dist(data))
result <- TopoCBN(dMatrix,nKNN=12,dist_matrix = TRUE)

#Plot the results:
set.seed(365)
distinct_clrs=randomcoloR::distinctColorPalette(result$nClust)
clrs<-distinct_clrs[result$assignments] # distinct colors for clusters
plot(data,col=clrs,pch=20,xlab='x',ylab='y',main = 'TopoCBN') 
print(result)

#We see that TopoCBN identified 4 clusters within our dataset of the sizes 
#1,3,5, and 21. These results suggest that companies with added values under $5,000 may 
#have any value of air pollution. However, companies with higher added values (>$5,000)
#correspond to the dramatically increased (deteriorated) levels of air pollution.
# }
# NOT RUN {
# }