ssmf: Simplex-structured matrix factorisation algorithm (SSMF).

Description

This function implements on SSMF on a data matrix or data frame.

Usage

ssmf(
  data,
  k,
  H = NULL,
  W = NULL,
  meth = c("kmeans", "uniform", "dirichlet", "nmf"),
  lr = 0.01,
  nruns = 50
)

Value

W The optimised $W$ matrix, containing the values of prototypes.

H The optimised $H$ matrix, containing the values of soft memberships.

SSE The residuals sum of square.

Arguments

data: Data matrix or data frame.
k: The number of prototypes/clusters.
H: Matrix, user input $H$ matrix to start the algorithm. If input is empty, the function will initialise $H$ matrix automatically.
W: Matrix, user input $W$ matrix to start the algorithm. If input is empty, the function will initialise $W$ matrix automatically.
meth: Specification of method to initialise the $W$ and $H$ matrix, see 'method' in init().
lr: Optimisation learning rate.
nruns: The maximum times of running the algorithm.

Author

Wenxuan Liu

Details

Let $X \in R^{n \times p}$ be the data set with $n$ observations and $p$ variables. Given an integer $k \ll \text{min}(n,p)$, the data set is clustered by simplex-structured matrix factorisation (SSMF), which aims to process soft clustering and partition the observations into $k$ fuzzy clusters such that the sum of squares from observations to the assigned cluster prototypes is minimised. SSMF finds $H_{n \times k}$ and $W_{k \times p}$, such that $$X \approx HW,$$ A cluster prototype refers to a vector that represent the characteristics of a particular cluster, denoted by $w_r \in \mathbb{R}^{p}$ , where $r$ is the $r^{th}$ cluster. A cluster membership vector $h_i \in \mathbb{R}^{k}$ describes the proportion of the cluster prototypes of the $i^{th}$ observation. $W$ is the prototype matrix where each row is the cluster prototype and $H$ is the soft membership matrix where each row gives the soft cluster membership of each observation. The problem of finding the approximate matrix factorisation is solved by minising residual sum of squares (RSS), that is $$\mathrm{RSS} = \| X-HW \|^2 = \sum_{i=1}^{n}\sum_{j=1}^{p} \left\{ X_{ij}-(HW)_{ij}\right\}^2,$$ such that $\sum_{r=1}^k h_{ir}=1$ and $h_{ir}\geq 0$.

References

Abdolali, Maryam & Gillis, Nicolas. (2020). Simplex-Structured Matrix Factorization: Sparsity-based Identifiability and Provably Correct Algorithms. <doi:10.1137/20M1354982>

Examples

Run this code


# \donttest{
library(MetabolSSMF)

# Initialisation by user
data <- SimulatedDataset
k <- 4

## Initialised by kmeans
fit.km <- kmeans(data, centers = k)

H <- mclust::unmap(fit.km$cluster)
W <- fit.km$centers

fit1 <- ssmf(data, k = k, H = H) #start the algorithm from H
fit2 <- ssmf(data, k = k, W = W) #start the algorithm from W

# Initialisation inside the function
fit3 <- ssmf(data, k = 4, meth = 'dirichlet')
fit4 <- ssmf(data, k = 4)
# }

Run the code above in your browser using DataLab