Learn R Programming

SCGLR (version 1.1)

scglrCrossVal: Function that fits and selects the number of component by cross-validation

Description

Function that fits and selects the number of component by cross-validation

Usage

scglrCrossVal(formula, data, family, K = 1, nfolds = 5,
    type = "mspe", size = NULL, offset = NULL,
    subset = NULL, na.action = na.omit, crit = list(),
    mc.cores = 1)

Arguments

formula
an object of class "Formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted
data
the data frame to be modeled
family
a vector of character of length q specifying the distributions of the responses. Bernoulli, binomial, poisson and gaussian are allowed.
K
number of components, default is one
nfolds
number of folds, default is 5. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets.
type
loss function to use for cross-validation. Currently six options are available depending on whether the responses are of the same distribution family. If the responses are all bernoulli distributed, then the prediction performance may be measured
size
specifies the number of trials of the binomial variables included in the model. A (n*qb) matrix is expected for qb binomial variables.
offset
used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected
subset
an optional vector specifying a subset of observations to be used in the fitting process
na.action
a function which indicates what should happen when the data contain NAs. The default is set to the na.omit
crit
of maxit and tol, default is 50 and 10e-6. If responses are bernoulli variables only, tol should generally be increased.
mc.cores
max number of cores to use when using parallelization (sorry not available for Windows yet)

Value

  • a matrix containing the criterion values for each response (rows) and each number of components (columns)

Examples

Run this code
library(SCGLR)

# load sample data
data(genus)

# get variable names from dataset
n <- names(genus)
ny <- n[grep("^gen",n)]    # Y <- names that begins with "gen"
nx <- n[-grep("^gen",n)]   # X <- remaining names

# remove "geology" and "surface" from nx
# as surface is offset and we want to use geology as additional covariate
nx <-nx[!nx%in%c("geology","surface")]

# build multivariate formula
# we also add "lat*lon" as computed covariate
form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),c("geology"))

# define family
fam <- rep("poisson",length(ny))

# cross validation
genus.cv <- scglrCrossVal(formula=form, data=genus, family=fam, K=12,
 offset=genus$surface)

# find best K
mean.crit <- t(apply(genus.cv,1,function(x) x/mean(x)))
mean.crit <- apply(mean.crit,2,mean)
K.cv <- which.min(mean.crit)-1

#plot(mean.crit, type="l")

Run the code above in your browser using DataLab