view.contribution: Evaluate the contribution of data views in making prediction

Description

Evaluate the contribution of each data view in making prediction. The function has two options. If force is set to NULL, the data view contribution is benchmarked by the null model. If force is set to a list of data views, the contribution is benchmarked by the model fit on this list of data views, and the function evaluates the marginal contribution of each additional data view on top of this benchmarking list of views. The function returns a table showing the percentage improvement in reducing error as compared to the bechmarking model made by each data view.

Usage

view.contribution(
  x_list,
  y,
  family = gaussian(),
  rho,
  s = c("lambda.min", "lambda.1se"),
  eval_data = c("train", "test"),
  weights = NULL,
  type.measure = c("default", "mse", "deviance", "class", "auc", "mae", "C"),
  x_list_test = NULL,
  test_y = NULL,
  nfolds = 10,
  foldid = NULL,
  force = NULL,
  ...
)

Value

a data frame consisting of the view, error metric, and percentage improvement.

Arguments

x_list: a list of x matrices with same number of rows nobs
y: the quantitative response with length equal to nobs, the (same) number of rows in each x matrix
family: A description of the error distribution and link function to be used in the model. This is the result of a call to a family function. Default is stats::gaussian. (See stats::family for details on family functions.)
rho: the weight on the agreement penalty, default 0. rho=0 is a form of early fusion, and rho=1 is a form of late fusion. We recommend trying a few values of rho including 0, 0.1, 0.25, 0.5, and 1 first; sometimes rho larger than 1 can also be helpful.
s: Value(s) of the penalty parameter lambda at which predictions are required. Default is the value s="lambda.1se" stored on the CV object. Alternatively s="lambda.min" can be used. If s is numeric, it is taken as the value(s) of lambda to be used. (For historical reasons we use the symbol 's' rather than 'lambda' to reference this parameter)
eval_data: If train, we evaluate the contribution of data views based on training data using cross validation error; if test, we evaluate the contribution of data views based on test data. Default is train. If set to test, users need to provide the test data, i.e. x_list_test and y_test.
weights: Observation weights; defaults to 1 per observation
type.measure: loss to use for cross-validation. Currently five options, not all available for all models. The default is type.measure="deviance", which uses squared-error for gaussian models (a.k.a type.measure="mse" there), deviance for logistic and poisson regression, and partial-likelihood for the Cox model. type.measure="class" applies to binomial and multinomial logistic regression only, and gives misclassification error. type.measure="auc" is for two-class logistic regression only, and gives area under the ROC curve. type.measure="mse" or type.measure="mae" (mean absolute error) can be used by all models except the "cox"; they measure the deviation from the fitted mean to the response. type.measure="C" is Harrel's concordance measure, only available for cox models.
x_list_test: A list of x matrices in the test data for evaluation.
test_y: The quantitative response in the test data with length equal to the number of rows in each x matrix of the test data.
nfolds: number of folds - default is 10. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is nfolds=3
foldid: an optional vector of values between 1 and nfold identifying what fold each observation is in. If supplied, nfold can be missing.
force: If NULL, the data view contribution is benchmarked by the null model. If users want to benchmark by the model fit on a specified list of data views, force needs to be set to this list of benchmarking data views, i.e. a list of x matrices. The function then evaluates the marginal contribution of each additional data, i.e. the data views in x_list but not in force, on top of the benchmarking views.
...: Other arguments that can be passed to multiview

Examples

Run this code

set.seed(3)
# Simulate data based on the factor model
x = matrix(rnorm(200*20), 200, 20)
z = matrix(rnorm(200*20), 200, 20)
w = matrix(rnorm(200*20), 200, 20)
U = matrix(rep(0, 200*10), 200, 10) # latent factors
for (m in seq(10)){
    u = rnorm(200)
    x[, m] = x[, m] + u
    z[, m] = z[, m] + u
    w[, m] = w[, m] + u
    U[, m] = U[, m] + u}
beta_U = c(rep(2, 5),rep(-2, 5))
y = U %*% beta_U + 3 * rnorm(100)

# Split training and test sets
smp_size_train = floor(0.9 * nrow(x))
train_ind = sort(sample(seq_len(nrow(x)), size = smp_size_train))
test_ind = setdiff(seq_len(nrow(x)), train_ind)
train_X = scale(x[train_ind, ])
test_X = scale(x[test_ind, ])
train_Z <- scale(z[train_ind, ])
test_Z <- scale(z[test_ind, ])
train_W <- scale(w[train_ind, ])
test_W <- scale(w[test_ind, ])
train_y <- y[train_ind, ]
test_y <- y[test_ind, ]
foldid = sample(rep_len(1:10, dim(train_X)[1]))

# Benchmarked by the null model:
rho = 0.3
view.contribution(x_list=list(x=train_X,z=train_Z), train_y, rho = rho,
                  eval_data = 'train', family = gaussian())
view.contribution(x_list=list(x=train_X,z=train_Z), train_y, rho = rho,
                  eval_data = 'test', family = gaussian(),
                  x_list_test=list(x=test_X,z=test_Z), test_y=test_y)

# Force option -- benchmarked by the model train on a specified list of data views:
view.contribution(x_list=list(x=train_X,z=train_Z,w=train_W), train_y, rho = rho,
                  eval_data = 'train', family = gaussian(), force=list(x=train_X))

Run the code above in your browser using DataLab