LPS: Linear Predictor Score fitting

Description

This function trains a Linear Predictor Score model, given pre-computed coefficients. It uses data with known classes to fit the model. It has numerous way to be called, and all the arguments are not mandatory. See the 'Examples' section.

Usage

LPS(data, coeff, response, k, threshold, formula, method = "fdr", ...)

Arguments

data

Continuous data used to retrieve classes, as a data.frame or matrix, with samples in rows and features (genes) in columns. Rows and columns should be named. Some precautions must be taken concerning data normalization, se

coeff

Pre-computed coefficients for the model, as returned by LPS.coeff (see there for format details).

response

Already known classes for the samples provided in data, preferably as a two-level factor. Can be missing if a formula with a response element is provided, but this argument precedes.

Single integer value, amount of features to include in the model, in decreasing order of coefficient. Can be missing if threshold or formula are provided, but this argument precedes other both of them.

threshold

Single numeric value, p-value threshold to apply for feature selection. Can be missing if k or formula are provided, but k precedes on it and it precedes on formula.

formula

A formula object, describing the model to fit (several templates are handled, see 'Examples'). The formula response element (before the "~" sign) can replace the response argument if it is not provided. The variables (after the "

method

Single character value, to be passed to p.adjust when threshold is provided.

...

Further arguments are passed to model.frame if response is missing (thus defined via formula). subset and na.action may be particularly useful for cro

Value

An object of (S3) class "LPS" :
coeffNamed numeric vector, the coefficients used in the model.
classesCharacter vector, the labels of the two groups to be predicted.
scoresList of two numeric vectors, training dataset scores sorted by group.
meansNumeric vector, score means of each group in the training dataset.
sdsNumeric vector, score sd of each group in the training dataset.
ovlNumeric value, overlapping coefficient as returned by OVL.
kInteger value, amount of features selected in the model (if relevant).
p.thresholdNumeric value, threshold used for feature selection (if relevant).
p.methodCharacter value, p-value correction used for feature selection (if relevant).

Normalization

As expression values are directly used in the score, gene centering and scaling are strongly recommended. For Affymetrix raw expression values (strictly positive, linear and absolute), Wright et al. suggests a multiplicative centering on a median of 1000 followed by a log2 transformation. For log-ratio, gene centering and scaling should not be necessary, as they are naturally 0-centered.

Time efficiency

Using a numeric matrix as data and a factor as response is the fastest way to compute coefficients, if time consumption matters (as in cross-validation schemes). formula is there only for consistency with R modeling functions, and to provide response, k or threshold in a single way.

References

Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol. 2002;9(3):505-11. Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9991-6. Bohers E, Mareschal S, Bouzelfen A, Marchand V, Ruminy P, Maingonnat C, Menard AL, Etancelin P, Bertrand P, Dubois S, Alcantara M, Bastard C, Tilly H, Jardin F. Targetable activating mutations are very frequent in GCB and ABC diffuse large B-cell lymphoma. Genes Chromosomes Cancer. 2014 Feb;53(2):144-53.

Examples

Run this code

# Data with features in columns
  data(rosenwald)
  group <- rosenwald.cli$group
  expr <- t(rosenwald.expr)
  
  # NA imputation (feature's mean to minimize impact)
  f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x }
  expr <- apply(expr, 2, f)
  
  # Coefficients
  coeff <- LPS.coeff(data=expr, response=group)
  
  
  # 10 best features (straightforward)
  m <- LPS(data=expr, coeff=coeff, response=group, k=10)
  
  # 10 best features (formula)
  ### 'k' MUST be an integer, or will be understood as a 'threshold'
  ### Numbers are "numeric", enforce integer with "L" or "as.integer"
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~10L)
  k <- as.integer(10)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~k)
  
  # FDR threshold
  thr <- 0.01
  m <- LPS(data=expr, coeff=coeff, response=group, threshold=thr)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~0.01)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~thr)
  
  # Custom model
  m <- LPS(data=expr, coeff=coeff[ c("27481","17013") ,], response=group, k=2)
  m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~`27481`+`17013`)
  ### Notice backticks in formula for syntactically invalid names
  
  # Complete model
  m <- LPS(data=expr, coeff=coeff, response=group, k=ncol(expr))
  m <- LPS(data=expr, coeff=coeff, response=group, threshold=1)
  ### m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~.)
  ### The last is correct but (really) slow on large datasets

Run the code above in your browser using DataLab