Learn R Programming

RcppPlanc (version 2.0.13)

onlineINMF: Perform Integrative Non-negative Matrix Factorization Using Online Learning

Description

Performs integrative non-negative matrix factorization (iNMF) (J.D. Welch, 2019, C. Gao, 2021) using online learning approach to return factorized \(H\), \(W\), and \(V\) matrices. The objective function is stated as

$$\arg\min_{H\ge0,W\ge0,V\ge0}\sum_{i}^{d}||E_i-(W+V_i)Hi||^2_F+ \lambda\sum_{i}^{d}||V_iH_i||_F^2$$

where \(E_i\) is the input non-negative matrix of the \(i\)'th dataset, \(d\) is the total number of datasets. \(E_i\) is of size \(m \times n_i\) for \(m\) features and \(n_i\) sample points, \(H_i\) is of size \(k \times n_i\), \(V_i\) is of size \(m \times k\), and \(W\) is of size \(m \times k\).

Different from inmf which optimizes the objective with ANLS approach, onlineINMF optimizes the same objective with online learning strategy, where it updates mini-batches of \(H_i\) solving the NNLS problem, and updates \(V_i\) and \(W\) with HALS multiplicative method.

This function allows online learning in 3 scenarios:

  1. Fully observed datasets;

  2. Iterative refinement using continually arriving datasets;

  3. Projection of new datasets without updating the existing factorization

Usage

onlineINMF(
  objectList,
  newDatasets = NULL,
  project = FALSE,
  k = 20,
  lambda = 5,
  maxEpoch = 5,
  minibatchSize = 5000,
  maxHALSIter = 1,
  permuteChunkSize = 1000,
  nCores = 2,
  Hinit = NULL,
  Vinit = NULL,
  Winit = NULL,
  Ainit = NULL,
  Binit = NULL,
  verbose = FALSE
)

Value

A list of the following elements:

  • H - a list of result \(H_i\) matrices of size \(n_i \times k\)

  • V - a list of result \(V_i\) matrices

  • W - the result \(W\) matrix

  • A - a list of result \(A_i\) matrices, \(k \times k\)

  • B - a list of result \(B_i\) matrices, \(m \times k\)

  • objErr - the final objective error value.

Arguments

objectList

list of input datasets. List elements should all be of the same class. Viable classes include: matrix, dgCMatrix, H5Mat, H5SpMat.

newDatasets

Same requirements as for new arriving datasets. Default NULL for scenario 1, specify for scenario 2 or 3.

project

Logical scalar, whether to run scenario 3. See description. Default FALSE.

k

Integer. Inner dimensionality to factorize the datasets into. Default 20.

lambda

Regularization parameter. Larger values penalize dataset-specific effects more strongly (i.e. alignment should increase as lambda increases). Default 5.

maxEpoch

The number of epochs to iterate through. Default 5.

minibatchSize

Total number of cells in each mini-batch. Default 5000.

maxHALSIter

Maximum number of block coordinate descent (HALS algorithm) iterations to perform for each update of \(W\) and \(V\). Default 1. Changing this parameter is not recommended.

permuteChunkSize

Number of cells in a chunk being shuffled before subsetting to minibatches. Only appliable to in-memory data and for Scenario 1 and 2. Default 1000.

nCores

The number of parallel tasks that will be spawned. Default 2

Hinit, Vinit, Winit, Ainit, Binit

Pass the previous factorization result for datasets existing in objectList, in order to run scenario 2 or 3. All should have length(objectList) matrices inside. See description for dimensionality of \(H_i\), \(V_i\) and \(W_i\). \(A_i\) should be of size \(k \times k\) and \(B_i\) should be of size \(m \times k\)

verbose

Logical scalar. Whether to show information and progress. Default FALSE.

Author

Yichen Wang

References

Joshua D. Welch and et al., Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity, Cell, 2019

Chao Gao and et al., Iterative single-cell multi-omic integration using online learning, Nat Biotechnol., 2021

Examples

Run this code
library(Matrix)

# Scenario 1 with sparse matrices
set.seed(1)
res1 <- onlineINMF(list(ctrl.sparse, stim.sparse),
                   minibatchSize = 50, k = 10, verbose = FALSE)

# Scenario 2 with H5 dense matrices
h5dense1 <- H5Mat(filename = system.file("extdata", "ctrl_dense.h5",
                             package = "RcppPlanc", mustWork = TRUE),
                                          dataPath = "scaleData")
h5dense2 <- H5Mat(filename = system.file("extdata", "stim_dense.h5",
                             package = "RcppPlanc", mustWork = TRUE),
                                          dataPath = "scaleData")
res2 <- onlineINMF(list(ctrl = h5dense1), minibatchSize = 50, k = 10, verbose = FALSE)
res3 <- onlineINMF(list(ctrl = h5dense1),
                   newDatasets = list(stim = h5dense2),
                   Hinit = res2$H, Vinit = res2$V, Winit = res2$W,
                   Ainit = res2$A, Binit = res2$B,
                   minibatchSize = 50, k = 10, verbose = FALSE)

# Scenario 3 with H5 sparse matrices
h5sparse1 <- H5SpMat(filename = system.file("extdata", "ctrl_sparse.h5",
                                package = "RcppPlanc", mustWork = TRUE),
                                valuePath = "scaleDataSparse/data",
                                rowindPath = "scaleDataSparse/indices",
                                colptrPath = "scaleDataSparse/indptr",
                                nrow = nrow(ctrl.sparse),
                                ncol = ncol(ctrl.sparse))
h5sparse2 <- H5SpMat(filename = system.file("extdata", "stim_sparse.h5",
                                package = "RcppPlanc", mustWork = TRUE),
                                valuePath = "scaleDataSparse/data",
                                rowindPath = "scaleDataSparse/indices",
                                colptrPath = "scaleDataSparse/indptr",
                                nrow = nrow(stim.sparse),
                                ncol = nrow(stim.sparse))
res4 <- onlineINMF(list(ctrl = h5sparse1), minibatchSize = 50, k = 10, verbose = FALSE)
res5 <- onlineINMF(list(ctrl = h5sparse1),
                   newDatasets = list(stim = h5sparse2), project = TRUE,
                   Hinit = res4$H, Vinit = res4$V, Winit = res4$W,
                   Ainit = res4$A, Binit = res4$B,
                   minibatchSize = 50, k = 10, verbose = FALSE)

Run the code above in your browser using DataLab