SSI: Sparse Selection Index

Description

Computes the entire Elastic-Net solution for the regression coefficients of a Selection Index for a grid of values of the penalization parameter.

An optimal penalization can be chosen using cross-validation (CV) within a specific training set.

Usage

SSI(y, X = NULL, b = NULL, Z = NULL, K, D = NULL, 
    theta = NULL, h2 = NULL, trn = seq_along(y), 
    tst = seq_along(y), subset = NULL, alpha = 1, lambda = NULL,
    nlambda = 100, lambda.min = .Machine$double.eps^0.5,
    common.lambda = TRUE, tol = 1E-4, maxiter = 500,
    method = c("REML","ML"), return.beta = FALSE, save.beta = TRUE, 
    save.at = NULL, name = NULL, mc.cores = 1, verbose = TRUE)
    
SSI.CV(y, X = NULL, b = NULL, Z = NULL, K, D = NULL, 
      theta = NULL, h2 = NULL, trn = seq_along(y), alpha = 1, 
      lambda = NULL, nlambda = 100, lambda.min = .Machine$double.eps^0.5,
      nCV = 1, nfolds = 5, seed = NULL, common.lambda = TRUE,
      tol = 1E-4, maxiter = 500, method = c("REML","ML"), 
      name = NULL, mc.cores = 1, verbose = TRUE)

Value

Function 'SSI' returns a list object of the class 'SSI' for which methods coef, fitted, plot, and summary exist. Functions 'net.plot' and 'path.plot' can be also used. It contains the elements:

b: (vector) fixed effects solutions (including the intercept).
Xb: (vector) product of the design matrix 'X' times the fixed effects solutions.
u: (matrix) genetic values for testing individuals (in rows) associated to each value of lambda (in columns).
varU, varE, h2: variance components solutions.
alpha: value for the elastic-net weights used.
lambda: (matrix) sequence of values of lambda used (in columns) for each testing individual (in rows).
df: (matrix) degrees of freedom (number of non-zero predictors) at each solution given by lambda for each testing individual (in rows).
file_beta: path where regression coefficients are saved.

Function 'SSI.CV' returns a list object of length nCV of the class 'SSI.CV' for which methods plot and summary exist. Each element is also a list containing the elements:

b: (vector) solutions for the fixed effects (including the intercept) for each fold.
varU, varE, h2: variance components estimated within each fold.
folds: (matrix) assignation of training individuals to folds used for the cross-validation.
accuracy: (matrix) correlation between observed and predicted values (in testing set) within each fold (in rows).
MSE: (matrix) mean squared error of prediction (in testing set) within each fold (in rows).
lambda: (matrix) with the sequence of values of lambda used (averaged across individuals) within each fold (in rows).
df: (matrix) with the degrees of freedom (averaged across individuals) within each fold (in rows).

Arguments

y

(numeric vector) Response variable

X

(numeric matrix) Design matrix for fixed effects. When X=NULL, a vector of ones is constructed only for the intercept (default)

b

(numeric vector) Fixed effects. When b=NULL, only the intercept is estimated from training data using generalized least squares (default)

Z

(numeric matrix) Design matrix for random effects. When Z=NULL an identity matrix is considered (default) thus G = K; otherwise G = Z K Z' is used

K

(numeric matrix) Kinship relationships. This can be of the "float32" type as per the 'float' R-package, or a (character) name of a binary file where the matrix is stored

D

(numeric matrix) Relationships among residuals (usually an identity matrix). When D=NULL an identity matrix is considered (default)

theta

(numeric) Residual/genetic variances ratio. When theta=NULL (default), it is calculated from training data using the function 'fitBLUP' (see help(fitBLUP))

h2

(numeric) Heritability of the response variable. When h2=NULL (default), it is calculated from training data using the function 'fitBLUP' (see help(fitBLUP)). It is used to set the residual/genetic variances ratio theta

trn

(integer vector) Which elements from vector y are in training set. Default trn=seq_along(y) will consider all individuals as training

tst

(integer vector) Which elements from vector y are in testing set (prediction set). Default tst=seq_along(y) will consider all individuals as testing

subset

(integer vector) \(c(m,M)\) to fit the model only for the m^th subset out of \(M\) subsets that the testing set will be divided into. Results can be automatically saved when saveAt argument is provided and can be retrieved later using function 'collect' (see help(collect)). Default is subset=NULL for no subsetting, then the model is fitted for all testing data

alpha

(numeric) Value between 0 and 1 for the weights given to the L1 and L2-penalties

lambda

(numeric vector) Penalization parameter sequence. Default is lambda=NULL, in this case a decreasing grid of nlambda lambdas will be generated starting from a maximum equal to

max(abs(G[trn,tst])/alpha)

to a minimum equal to zero. If alpha=0 the grid is generated starting from a maximum equal to 5

nlambda

(integer) Number of lambdas generated when lambda=NULL

lambda.min

(numeric) Minimum value of lambda in the generated grid when lambda=NULL

nfolds

(integer/character) Either 2,3,5,10 or 'n' indicating the number of non-overlaping folds in which the data is split into to do cross-validation. When nfolds='n' leave-one-out CV is performed

seed

(numeric vector) Seed to fix randomization when creating folds for cross-validation. If it is a vector, a number equal to its length of CV repetitions are performed

nCV

(integer) Number of CV repetitions to be performed. Default is nCV=1

common.lambda

TRUE or FALSE to whether computing the coefficients for a grid of lambdas common to all individuals in testing set or for a grid of lambdas specific to each individual in testing set. Default is common.lambda=TRUE

mc.cores

(integer) Number of cores used. The analysis is run in parallel when mc.cores is greater than 1. Default is mc.cores=1

tol

(numeric) Maximum error between two consecutive solutions of the CD algorithm to declare convergence

maxiter

(integer) Maximum number of iterations to run the CD algorithm at each lambda step before convergence is reached

return.beta

TRUE or FALSE to whether return regression coefficients in the output object

save.beta

TRUE or FALSE to whether save regression coefficients. When TRUE coefficients are not returned in the output object and instead the path where coefficients were saved is returned. They can be retrieved using coef method if at least one return.beta or save.beta is TRUE

save.at

(character) Prefix name that will be added to the files (coefficients and output object) name to be saved, this may include a path. Regression coefficients are saved as a binary file (single-precision: 32 bits, 7 significant digits). Default save.at=NULL will no save any output

method

(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood) to calculate variance components as per the function 'fitBLUP'

name

(character) Name given to the output for tagging purposes. Default name=NULL will give the name of the method used

verbose

TRUE or FALSE to whether printing each step

Author

Marco Lopez-Cruz (maraloc@gmail.com) and Gustavo de los Campos

Details

The basic linear mixed model that relates phenotypes with genetic values is of the form

y = X b + Z g + e

where y is a vector with the response, b is the vector of fixed effects, g is the vector of the genetic values of the genotypes, e is the vector of environmental residuals, and X and Z are design matrices conecting the fixed and genetic effects with replicates. Genetic values are assumed to follow a Normal distribution as g ~ N(0,σ²_uK), and environmental terms are assumed e ~ N(0,σ²_eD), usually D=I.

The resulting vector of genetic values u = Z g will therefore follow u ~ N(0,σ²_uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.

The values u_tst = (u_i), i = 1,2,...,n_tst, for a testing set are estimated individual-wise using (as predictors) all available observations in a training set as

u_i = β'_i (y_trn - X_trnb)

where β_i is a vector of weights that are found separately for each individual in the testing set, by minimizing the penalized mean squared error function

-G'_trn,tst(i) β_i + 1/2 β'_i(G_trn + θD)β_i + λ J(β_i)

where G_trn,tst(i) is the i^th column of the sub-matrix of G whose rows correspond to the training set and columns to the testing set; G_trn is the sub-matrix corresponding to the training set; θ = σ²_e/σ²_u is the residual to genetic variance ratio that can be expressed in terms of the heritability, h² = σ²_u/(σ²_u + σ²_e), as θ = (1 - h²)/h²; λ is the penalization parameter; and J(β_i) is a penalty function given by

1/2(1-α)||β_i||₂² + α||β_i||₁

where 0 ≤ α ≤ 1, and ||β_i||₁ = ∑_j=1|β_ij| and ||β_i||₂² = ∑_j=1β_ij² are the L1 and (squared) L2-norms, respectively.

Function 'SSI' calculates each individual solution using the function 'solveEN' (via the Coordinate Descent algorithm, see help(solveEN)) by setting the argument Sigma equal to G_trn + θD and Gamma equal to G_trn,tst(i).

Function 'SSI.CV' performs cross-validation within the training data specified in argument trn. Training data is divided into \(k\) folds and the SSI is sequentially calculated for (all individuals in) one fold (as testing set) using information from the remaining folds (as training set).

References

Efron B, Hastie T, Johnstone I, Tibshirani R (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.

Friedman J, Hastie T, Höfling H, Tibshirani R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.

Hoerl AE, Kennard RW (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.

Lush JL (1947). Family merit an individual merit as bases for selection. Part I. The American Naturalist, 81(799), 241–261.

Tibshirani R (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B, 58(1), 267–288.

VanRaden PM (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

Zou H, Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2), 301–320

Examples

Run this code

  require(SFSI)
  data(wheatHTP)
  
  index = which(Y$trial %in% 1:6)     # Use only a subset of data
  Y = Y[index,]
  M = scale(M[index,])/sqrt(ncol(M))  # Subset and scale markers
  G = tcrossprod(M)                   # Genomic relationship matrix
  y = as.vector(scale(Y[,"E1"]))      # Scale response variable
  
  # Training and testing sets
  tst = which(Y$trial == 2)
  trn = which(Y$trial != 2)

  # Calculate variance components ratio using training data
  yNA = y
  yNA[tst] = NA
  fm0 = fitBLUP(yNA,K=G)
  theta = fm0$varE/fm0$varU
  h2 = fm0$varU/(fm0$varU + fm0$varE)
  b = fm0$b       # intercept
  
  # Sparse selection index
  fm1 = SSI(y,K=G,theta=theta,b=b,trn=trn,tst=tst)
  summary(fm1)$optCOR
  # \donttest{
  if(requireNamespace("float")){
  # Using a 'float' type variable for K
  G2 = float::fl(G)
  fm2 = SSI(y,K=G2,theta=theta,b=b,trn=trn,tst=tst)
  summary(fm2)$optCOR  # compare with above results
  }
  # }
  #---------------------------------------------------
  # Predicting a testing set using a value of lambda
  # obtained from cross-validation in a traning set
  #---------------------------------------------------
  # \donttest{
  # Run a cross validation in training set
  fm2 = SSI.CV(y,K=G,theta=theta,b=b,trn=trn,nfolds=5,name="1 5CV")
  lambda = summary(fm2)$optCOR["lambda"]

  # Fit the index with the obtained lambda
  fm3 = SSI(y,K=G,theta=theta,b=b,trn=trn,tst=tst,lambda=lambda)
  summary(fm3)$accuracy        # Testing set accuracy

  # Compare the accuracy with that of the non-sparse index (G-BLUP)
  cor(fm0$u[tst],y[tst])
  
  # Obtain an 'optimal' lambda by repeating the CV several times
  fm22 = SSI.CV(y,K=G,theta=theta,b=b,trn=trn,nCV=5,name="5 5CV")
  plot(fm22,fm2)
  # }

Run the code above in your browser using DataLab