nmf: Running NMF algorithms

Description

The function nmf is a S4 generic defines the main interface to run NMF algorithms within the framework defined in package NMF. It has many methods that facilitates applying, developing and testing NMF algorithms.

The package vignette vignette('NMF') contains an introduction to the interface, through a sample data analysis.

Usage

nmf(x, rank, method, ...)
  "nmf"(x, rank, method, seed = NULL, model = NULL, ...)
  "nmf"(x, rank, method, ..., .parameters = list())
  "nmf"(x, rank, method, seed, model = "NMFstd", ..., name, objective = "euclidean", mixed = FALSE)
  "nmf"(x, rank, method, seed, ...)
  "nmf"(x, rank, method, seed, ...)
  "nmf"(x, rank, method, seed, model = list(), ...)
  "nmf"(x, rank, method, ..., model = NULL)
  "nmf"(x, rank, method, seed = nmf.getOption("default.seed"), rng = NULL, nrun = if (length(rank) > 1) 30 else 1, model = NULL, .options = list(), .pbackend = nmf.getOption("pbackend"), .callback = NULL, ...)

Arguments

target data to fit, i.e. a matrix-like object

rank

specification of the factorization rank. It is usually a single numeric value, but other type of values are possible (e.g. matrix), for which specific methods are implemented. See for example methods nmf,matrix,matrix,ANY.

If rank is a numeric vector with more than one element, e.g. a range of ranks, then nmf performs the estimation procedure described in nmfEstimateRank.

method

specification of the NMF algorithm. The most common way of specifying the algorithm is to pass the access key (i.e. a character string) of an algorithm stored in the package's dedicated registry, but methods exists that handle other types of values, such as function or list object. See their descriptions in section Methods.

If method is missing the algorithm to use is obtained from the option nmf.getOption('default.algorithm'), unless it can be infer from the type of NMF model to fit, if this later is available from other arguments. Factory fresh default value is ‘brunet’, which corresponds to the standard NMF algorithm from Brunet2004 (see section Algorithms).

Cases where the algorithm is inferred from the call are when an NMF model is passed in arguments rank or seed (see description for nmf,matrix,numeric,NULL in section Methods).

...

extra arguments to allow extension of the generic. Arguments that are not used in the chain of internal calls to nmf methods are passed to the function that effectively implements the algorithm that fits an NMF model on x.

.parameters

list of method-specific parameters. Its elements must have names matching a single method listed in method, and be lists of named values that are passed to the corresponding method.

name

name associated with the NMF algorithm implemented by the function method [only used when method is a function].

objective

specification of the objective function associated with the algorithm implemented by the function method [only used when method is a function].

It may be either 'euclidean' or 'KL' for specifying the euclidean distance (Frobenius norm) or the Kullback-Leibler divergence respectively, or a function with signature (x="NMF", y="matrix", ...) that computes the objective value for an NMF model x on a target matrix y, i.e. the residuals between the target matrix and its NMF estimate. Any extra argument may be specified, e.g. function(x, y, alpha, beta=2, ...).

mixed

a logical that indicates if the algorithm implemented by the function method support mixed-sign target matrices, i.e. that may contain negative values [only used when method is a function].

seed

specification of the starting point or seeding method, which will compute a starting point, usually using data from the target matrix in order to provide a good guess.

The seeding method may be specified in the following way:

a character string:

giving the name of a registered seeding method. The corresponding method will be called to compute the starting point.

Available methods can be listed via nmfSeed(). See its dedicated documentation for details on each available registered methods (nmfSeed).

a list:

giving the name of a registered seeding method and, optionally, extra parameters to pass to it.

a single numeric:

that is used to seed the random number generator, before generating a random starting point.

Note that when performing multiple runs, the L'Ecuyer's RNG is used in order to produce a sequence of random streams, that is used in way that ensures that parallel computation are fully reproducible.

an object that inherits from NMF:

it should contain the data of an initialised NMF model, i.e. it must contain valid basis and mixture coefficient matrices, directly usable by the algorithm's workhorse function.

a function:

that computes the starting point. It must have signature

(object="NMF",
  target="matrix", ...)

and return an object that inherits from class NMF. It is recommended to use argument object as a template for the returned object, by only updating the basis and coefficient matrices, using basis<- and coef<- respectively.

rng

rng specification for the run(s). This argument should be used to set the the RNG seed, while still specifying the seeding method argument seed.

model

specification of the type of NMF model to use.

It is used to instantiate the object that inherits from class NMF, that will be passed to the seeding method. The following values are supported:

NULL, the default model associated to the NMF algorithm is instantiated and ... is looked-up for arguments with names that correspond to slots in the model class, which are passed to the function nmfModel to instantiate the model. Arguments in ... that do not correspond to slots are passed to the algorithm.

a single character string, that is the name of the NMF model class to be instantiate. In this case, arguments in ... are handled in the same way as when model is NULL.

a list that contains named values that are passed to the function nmfModel to instantiate the model. In this case, ... is not looked-up at all, and passed entirely to the algorithm. This means that all necessary model parameters must be specified in model.

Argument/slot conflicts: In the case a parameter of the algorithm has the same name as a model slot, then model MUST be a list -- possibly empty --, if one wants this parameter to be effectively passed to the algorithm.

If a variable appears in both arguments model and ..., the former will be used to initialise the NMF model, the latter will be passed to the NMF algorithm. See code examples for an illustration of this situation.

nrun

number of runs to perform. It specifies the number of runs to perform. By default only one run is performed, except if rank is a numeric vector with more than one element, in which case a default of 30 runs per value of the rank are performed, allowing the computation of a consensus matrix that is used in selecting the appropriate rank (see consensus).

When using a random seeding method, multiple runs are generally required to achieve stability and avoid bad local minima.

.options

this argument is used to set runtime options.

It can be a list containing named options with their values, or, in the case only boolean/integer options need to be set, a character string that specifies which options are turned on/off or their value, in a unix-like command line argument way.

The string must be composed of characters that correspond to a given option (see mapping below), and modifiers '+' and '-' that toggle options on and off respectively. E.g. .options='tv' will toggle on options track and verbose, while .options='t-v' will toggle on option track and toggle off option verbose.

Modifiers '+' and '-' apply to all option character found after them: t-vp+k means track=TRUE, verbose=parallel=FALSE, and keep.all=TRUE. The default behaviour is to assume that .options starts with a '+'.

for options that accept integer values, the value may be appended to the option's character e.g. 'p4' for asking for 4 processors or 'v3' for showing verbosity message up to level 3.

The following options are available (the characters after “-” are those to use to encode .options as a string):

debug - d

Toggle debug mode (default: FALSE). Like option verbose but with more information displayed.

keep.all - k

used when performing multiple runs (nrun>1): if TRUE, all factorizations are saved and returned (default: FALSE). Otherwise only the factorization achieving the minimum residuals is returned.

parallel - p

this option is useful on multicore *nix or Mac machine only, when performing multiple runs (nrun > 1) (default: TRUE). If TRUE, the runs are performed using the parallel foreach backend defined in argument .pbackend. If this is set to 'mc' or 'par' then nmf tries to perform the runs using multiple cores with package link[doParallel]{doParallel} -- which therefore needs to be installed.

If equal to an integer, then nmf tries to perform the computation on the specified number of processors. When passing options as a string the number is appended to the option's character e.g. 'p4' for asking for 4 processors.

If FALSE, then the computation is performed sequentially using the base function sapply.

Unlike option 'P' (capital 'P'), if the computation cannot be performed in parallel, then it will still be carried on sequentially.

IMPORTANT NOTE FOR MAC OS X USERS: The parallel computation is based on the doMC and multicore packages, so the same care should be taken as stated in the vignette of doMC: “it is not safe to use doMC from R.app on Mac OS X. Instead, you should use doMC from a terminal session, starting R from the command line.”

parallel.required - P

Same as p, but an error is thrown if the computation cannot be performed in parallel or with the specified number of processors.

shared.memory - m

toggle usage of shared memory (requires the synchronicity package). Default is as defined by nmf.getOption('shared.memory').

restore.seed - r

deprecated option since version 0.5.99. Will throw a warning if used.

simplifyCB - S

toggle simplification of the callback results. Default is TRUE

track - t

enables error tracking (default: FALSE). If TRUE, the returned object's slot residuals contains the trajectory of the objective values, which can be retrieved via

residuals(res,
  track=TRUE)

This tracking functionality is available for all built-in algorithms.

verbose - v

Toggle verbosity (default: FALSE). If TRUE, messages about the configuration and the state of the current run(s) are displayed. The level of verbosity may be specified with an integer value, the greater the level the more messages are displayed. Value FALSE means no messages are displayed, while value TRUE is equivalent to verbosity level 1.

.pbackend

specification of the foreach parallel backend to register and/or use when running in parallel mode. See options p and P in argument .options for how to enable this mode. Note that any backend that is internally registered is cleaned-up on exit, so that the calling foreach environment should not be affected by a call to nmf -- except when .pbackend=NULL.

Currently it accepts the following values:

‘par’

use the backend(s) defined by the package doParallel;

a numeric value

use the specified number of cores with doParallel backend;

‘seq’

use the foreach sequential backend doSEQ;

NULL

use currently registered backend;

NA

do not compute using a foreach loop -- and therefore not in parallel -- but rather use a call to standard sapply. This is useful for when developing/debugging NMF algorithms, as foreach loop handling may sometime get in the way.

Note that this is equivalent to using .options='-p' or .options='p0', but takes precedence over any option specified in .options: e.g. nmf(..., .options='P10', .pbackend=NA) performs all runs sequentially using sapply. Use nmf.options(pbackend=NA) to completely disable foreach/parallel computations for all subsequent nmf calls.

‘mc’

identical to ‘par’ and defined to ensure backward compatibility.

.callback

Used when option keep.all=FALSE (default). It allows to pass a callback function that is called after each run when performing multiple runs (i.e. with nrun>1). This is useful for example if one is also interested in saving summary measures or process the result of each NMF fit before it gets discarded. After each run, the callback function is called with two arguments, the NMFfit object that as just been fitted and the run number:

.callback(res,
  i)

. For convenience, a function that takes only one argument or has signature (x, ...) can still be passed in .callback. It is wrapped internally into a dummy function with two arguments, only the first of which is passed to the actual callback function (see example with summary).

The call is wrapped into a tryCatch so that callback errors do not stop the whole computation (see below).

The results of the different calls to the callback function are stored in a miscellaneous slot accessible using the method $ for NMFfit objects: res$.callback. By default nmf tries to simplify the list of callback result using sapply, unless option 'simplifyCB' is FASE.

If no error occurs res$.callback contains the list of values that resulted from the calling the callback function --, ordered as the fits. If any error occurs in one of the callback calls, then the whole computation is not stopped, but the error message is stored in res$.callback, in place of the result.

See the examples for sample code.

Value

Single run:: An object of class NMFfit.
Multiple runs, single method:: When nrun > 1 and method is not list, this method returns an object of class NMFfitX.
Multiple runs, multiple methods:: When nrun > 1 and method is a list, this method returns an object of class NMFList.

Methods

nmf: signature(x = "data.frame", rank = "ANY", method = "ANY"): Fits an NMF model on a data.frame. The target data.frame is coerced into a matrix with as.matrix.
nmf: signature(x = "matrix", rank = "numeric", method = "NULL"): Fits an NMF model using an appropriate algorithm when method is not supplied. This method tries to select an appropriate algorithm amongst the NMF algorithms stored in the internal algorithm registry, which contains the type of NMF models each algorithm can fit. This is possible when the type of NMF model to fit is available from argument seed, i.e. if it is an NMF model itself. Otherwise the algorithm to use is obtained from nmf.getOption('default.algorithm'). This method is provided for internal usage, when called from other nmf methods with argument method missing in the top call (e.g. nmf,matrix,numeric,missing).
nmf: signature(x = "matrix", rank = "numeric", method = "list"): Fits multiple NMF models on a common matrix using a list of algorithms. The models are fitted sequentially with nmf using the same options and parameters for all algorithms. In particular, irrespective of the way the computation is seeded, this method ensures that all fits are performed using the same initial RNG settings. This method returns an object of class NMFList, that is essentially a list containing each fit.
nmf: signature(x = "matrix", rank = "numeric", method = "character"): Fits an NMF model on x using an algorithm registered with access key method. Argument method is partially match against the access keys of all registered algorithms (case insensitive). Available algorithms are listed in section Algorithms below or the introduction vignette. A vector of their names may be retrieved via nmfAlgorithm().
nmf: signature(x = "matrix", rank = "numeric", method = "function"): Fits an NMF model on x using a custom algorithm defined the function method. The supplied function must have signature (x=matrix, start=NMF, ...) and return an object that inherits from class NMF. It will be called internally by the workhorse nmf method, with an NMF model to be used as a starting point passed in its argument start. Extra arguments in ... are passed to method from the top nmf call. Extra arguments that have no default value in the definition of the function method are required to run the algorithm (e.g. see argument alpha of myfun in the examples). If the algorithm requires a specific type of NMF model, this can be specified in argument model that is handled as in the workhorse nmf method (see description for this argument).
nmf: signature(x = "matrix", rank = "NMF", method = "ANY"): Fits an NMF model using the NMF model rank to seed the computation, i.e. as a starting point. This method is provided for convenience as a shortcut for nmf(x, nbasis(object), method, seed=object, ...) It discards any value passed in argument seed and uses the NMF model passed in rank instead. It throws a warning if argument seed not missing. If method is missing, this method will call the method nmf,matrix,numeric,NULL, which will infer an algorithm suitable for fitting an NMF model of the class of rank.
nmf: signature(x = "matrix", rank = "NULL", method = "ANY"): Fits an NMF model using the NMF model supplied in seed, to seed the computation, i.e. as a starting point. This method is provided for completeness and is equivalent to nmf(x, seed, method, ...).
nmf: signature(x = "matrix", rank = "missing", method = "ANY"): Method defined to ensure the correct dispatch to workhorse methods in case of argument rank is missing.
nmf: signature(x = "matrix", rank = "numeric", method = "missing"): Method defined to ensure the correct dispatch to workhorse methods in case of argument method is missing.
nmf: signature(x = "matrix", rank = "matrix", method = "ANY"): Fits an NMF model partially seeding the computation with a given matrix passed in rank. The matrix rank is used either as initial value for the basis or mixture coefficient matrix, depending on its dimension. Currently, such partial NMF model is directly used as a seed, meaning that the remaining part is left uninitialised, which is not accepted by all NMF algorithm. This should change in the future, where the missing part of the model will be drawn from some random distribution. Amongst built-in algorithms, only ‘snmf/l’ and ‘snmf/r’ support partial seeds, with only the coefficient or basis matrix initialised respectively.
nmf: signature(x = "matrix", rank = "data.frame", method = "ANY"): Shortcut for nmf(x, as.matrix(rank), method, ...).
nmf: signature(x = "formula", rank = "ANY", method = "ANY"): This method implements the interface for fitting formula-based NMF models. See nmfModel. Argument rank target matrix or formula environment. If not missing, model must be a list, a data.frame or an environment in which formula variables are searched for.

Optimized C++ vs. plain R

Lee and Seung's multiplicative updates are used by several NMF algorithms. To improve speed and memory usage, a C++ implementation of the specific matrix products is used whenever possible. It directly computes the updates for each entry in the updated matrix, instead of using multiple standard matrix multiplication. The algorithms that benefit from this optimization are: 'brunet', 'lee', 'nsNMF' and 'offset'. However there still exists plain R versions for these methods, which implement the updates as standard matrix products. These are accessible by adding the prefix '.R#' to their name: '.R#brunet', '.R#lee', '.R#nsNMF' and '.R#offset'.

Algorithms

All algorithms are accessible by their respective access key as listed below. The following algorithms are available:

‘brunet’: Standard NMF, based on the Kullback-Leibler divergence, from Brunet et al. (2004). It uses simple multiplicative updates from Lee et al. (2001), enhanced to avoid numerical underflow. Default stopping criterion: invariance of the connectivity matrix (see nmf.stop.connectivity).
‘lee’: Standard NMF based on the Euclidean distance from Lee et al. (2001). It uses simple multiplicative updates. Default stopping criterion: invariance of the connectivity matrix (see nmf.stop.connectivity).
ls-nmf: Least-Square NMF from Wang et al. (2006). It uses modified versions of Lee and Seung's multiplicative updates for the Euclidean distance, which incorporates weights on each entry of the target matrix, e.g. to reflect measurement uncertainty. Default stopping criterion: stationarity of the objective function (see nmf.stop.stationary).
‘nsNMF’: Nonsmooth NMF from Pascual-Montano et al. (2006). It uses a modified version of Lee and Seung's multiplicative updates for the Kullback-Leibler divergence Lee et al. (2001), to fit a extension of the standard NMF model, that includes an intermediate smoothing matrix, meant meant to produce sparser factors. Default stopping criterion: invariance of the connectivity matrix (see nmf.stop.connectivity).
‘offset’: NMF with offset from Badea (2008). It uses a modified version of Lee and Seung's multiplicative updates for Euclidean distance Lee et al. (2001), to fit an NMF model that includes an intercept, meant to capture a common baseline and shared patterns, in order to produce cleaner basis components. Default stopping criterion: invariance of the connectivity matrix (see nmf.stop.connectivity).
‘pe-nmf’: Pattern-Expression NMF from Zhang2008. It uses multiplicative updates to minimize an objective function based on the Euclidean distance, that is regularized for effective expression of patterns with basis vectors. Default stopping criterion: stationarity of the objective function (see nmf.stop.stationary).
‘snmf/r’, ‘snmf/l’: Alternating Least Square (ALS) approach from Kim et al. (2007). It applies the nonnegative least-squares algorithm from Van Benthem et al. (2004) (i.e. fast combinatorial nonnegative least-squares for multiple right-hand), to estimate the basis and coefficient matrices alternatively (see fcnnls). It minimises an Euclidean-based objective function, that is regularized to favour sparse basis matrices (for ‘snmf/l’) or sparse coefficient matrices (for ‘snmf/r’). Stopping criterion: built-in within the internal workhorse function nmf_snmf, based on the KKT optimality conditions.

Seeding methods

The purpose of seeding methods is to compute initial values for the factor matrices in a given NMF model. This initial guess will be used as a starting point by the chosen NMF algorithm. The seeding method to use in combination with the algorithm can be passed to interface nmf through argument seed. The seeding seeding methods available in registry are listed by the function nmfSeed (see list therein). Detailed examples of how to specify the seeding method and its parameters can be found in the Examples section of this man page and in the package's vignette.

Details

The nmf function has multiple methods that compose a very flexible interface allowing to:

combine NMF algorithms with seeding methods and/or stopping/convergence criterion at runtime;
perform multiple NMF runs, which are computed in parallel whenever the host machine allows it;
run multiple algorithms with a common set of parameters, ensuring a consistent environment (notably the RNG settings).

The workhorse method is nmf,matrix,numeric,NMFStrategy, which is eventually called by all other methods. The other methods provides convenient ways of specifying the NMF algorithm(s), the factorization rank, or the seed to be used. Some allow to directly run NMF algorithms on different types of objects, such as data.frame or ExpressionSet objects.

References

Brunet J, Tamayo P, Golub TR and Mesirov JP (2004). "Metagenes and molecular pattern discovery using matrix factorization." _Proceedings of the National Academy of Sciences of the United States of America_, *101*(12), pp. 4164-9. ISSN 0027-8424, , .

Lee DD and Seung H (2001). "Algorithms for non-negative matrix factorization." _Advances in neural information processing systems_. .

Wang G, Kossenkov AV and Ochs MF (2006). "LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates." _BMC bioinformatics_, *7*, pp. 175. ISSN 1471-2105, , .

Pascual-Montano A, Carazo JM, Kochi K, Lehmann D and Pascual-marqui RD (2006). "Nonsmooth nonnegative matrix factorization (nsNMF)." _IEEE Trans. Pattern Anal. Mach. Intell_, *28*, pp. 403-415.

Badea L (2008). "Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization." _Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing_, *290*, pp. 267-78. ISSN 1793-5091, .

Kim H and Park H (2007). "Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis." _Bioinformatics (Oxford, England)_, *23*(12), pp. 1495-502. ISSN 1460-2059, , .

Van Benthem M and Keenan MR (2004). "Fast algorithm for the solution of large-scale non-negativity-constrained least squares problems." _Journal of Chemometrics_, *18*(10), pp. 441-450. ISSN 0886-9383, , .

Examples

Run this code



# Only basic calls are presented in this manpage.
# Many more examples are provided in the demo file nmf.R
## Not run: 
# demo('nmf')
# ## End(Not run)

# random data
x <- rmatrix(20,10)

# run default algorithm with rank 2
res <- nmf(x, 2)

# specify the algorithm
res <- nmf(x, 2, 'lee')

# get verbose message on what is going on
res <- nmf(x, 2, .options='v')
## Not run: 
# # more messages
# res <- nmf(x, 2, .options='v2')
# # even more
# res <- nmf(x, 2, .options='v3')
# # and so on ...
# ## End(Not run)

Run the code above in your browser using DataLab