Learn R Programming

cmfrec (version 3.5.1-3)

imputeX: Impute missing entries in `X` data

Description

Replace `NA`/`NaN` values in new `X` data according to the model predictions, given that same `X` data and optionally `U` data.

Note: this function will not perform any internal re-indexing for the data. If the `X` to which the data was fit was a `data.frame`, the numeration of the items will be under `model$info$item_mapping`. There is also a function predict_new which will let the model do the appropriate reindexing.

Usage

imputeX(
  model,
  X,
  weight = NULL,
  U = NULL,
  U_bin = NULL,
  nthreads = model$info$nthreads
)

Value

The `X` matrix with its missing values imputed according to the model predictions.

Arguments

model

A collective matrix factorization model as output by function CMF. This functionality is not available for the other model classes.

X

New `X` data with missing values which will be imputed. Must be passed as a dense matrix from base R (class `matrix`).

weight

Associated observation weights for entries in `X`. If passed, must have the same shape as `X`.

U

New `U` data, with rows matching to those of `X`. Can be passed in the following formats:

  • A sparse COO/triplets matrix, either from package `Matrix` (class `dgTMatrix`), or from package `SparseM` (class `matrix.coo`).

  • A sparse matrix in CSR format, either from package `Matrix` (class `dgRMatrix`), or from package `SparseM` (class `matrix.csr`). Passing the input as CSR is faster than COO as it will be converted internally.

  • A sparse row vector from package `Matrix` (class `dsparseVector`).

  • A dense matrix from base R (class `matrix`), with missing entries set as `NA`/`NaN`.

  • A dense row vector from base R (class `numeric`).

  • A `data.frame`.

U_bin

New binary columns of `U` (rows matching to those of `X`). Must be passed as a dense matrix from base R or as a `data.frame`.

nthreads

Number of parallel threads to use.

Details

If using the matrix factorization model as a general missing-value imputer, it's recommended to:

  • Fit a model without user biases.

  • Set a lower regularization for the item biases than for the matrices.

  • Tune the regularization parameter(s) very well.

In general, matrix factorization works better for imputation of selected entries of sparse-and-wide matrices, whereas for dense matrices, the method is unlikely to provide better results than mean/median imputation, but it is nevertheless provided for experimentation purposes.

Examples

Run this code
library(cmfrec)

### Simplest example
SeqMat <- matrix(1:50, nrow=10)
SeqMat[2,1] <- NaN
SeqMat[8,3] <- NaN
m <- CMF(SeqMat, k=1, lambda=1e-10, nthreads=1L, verbose=FALSE)
imputeX(m, SeqMat)


### Better example with multivariate normal data
if (require("MASS")) {
    ### Generate random data, set some values as NA
    set.seed(1)
    n_rows <- 1000
    n_cols <- 5
    mu <- rnorm(n_cols)
    S <- matrix(rnorm(n_cols^2), nrow = n_cols)
    S <- t(S) %*% S
    X <- MASS::mvrnorm(n_rows, mu, S)
    X_na <- X
    values_NA <- matrix(runif(n_rows*n_cols) < .15, nrow=n_rows)
    X_na[values_NA] <- NaN
    
    ### In the event that any column is fully missing
    if (any(colSums(is.na(X_na)) == n_rows)) {
        cols_remove <- colSums(is.na(X_na)) == n_rows
        X_na <- X_na[, !cols_remove, drop=FALSE]
        values_NA <- values_NA[, !cols_remove, drop=FALSE]
    }
    
    ### Impute missing values with model
    model <- CMF(X_na, k=3, lambda=c(0,0,1,1,1,1),
                 user_bias=FALSE,
                 verbose=FALSE, nthreads=1L)
    X_imputed <- imputeX(model, X_na)
    cat(sprintf("RMSE for imputed values w/model: %f\n",
                sqrt(mean((X[values_NA] - X_imputed[values_NA])^2))))
    
    ### Compare against simple mean imputation
    X_means <- apply(X_na, 2, mean, na.rm=TRUE)
    X_imp_mean <- X_na
    for (cl in 1:n_cols)
        X_imp_mean[values_NA[,cl], cl] <- X_means[cl]
    cat(sprintf("RMSE for imputed values w/means: %f\n",
                sqrt(mean((X[values_NA] - X_imp_mean[values_NA])^2))))
}

Run the code above in your browser using DataLab