gamlssVGD: A Set of Functions for selecting Models using Validation or Test Data Sets and Cross Validation

Description

This is a set of function useful for selecting appropriate models.

The functions gamlssVGD, VGD, getTGD, TGD can be used when a subset of the data is used for validation or testing.

The function stepVGD() is a stepwise procedure for selecting an appropriate model for any of the parameters of the model minimising the test global deviance. The function stepVGDAll.A() can select a model using strategy A for all the parameters.

The functions gamlssCV, CV can be used for a k-fold cross validation.

Usage

gamlssVGD(formula = NULL, sigma.formula = ~1, nu.formula = ~1, 
          tau.formula = ~1, data = NULL, family = NO, 
          control = gamlss.control(trace = FALSE), 
          rand = NULL, newdata = NULL, ...)
          
VGD(object, ...)          
getTGD(object, newdata = NULL, ...)
TGD(object, ...)  
gamlssCV(formula = NULL, sigma.formula = ~1, nu.formula = ~1, 
         tau.formula = ~1, data = NULL, family = NO, 
         control = gamlss.control(trace = FALSE), 
         K.fold = 10, set.seed = 123, rand = NULL, 
         parallel = c("no", "multicore", "snow"), 
         ncpus = 1L, cl = NULL, ...)
CV(object, ...)
drop1TGD(object, scope, newdata, parameter = c("mu", "sigma", "nu", "tau"), 
         sorted = FALSE, trace = FALSE, 
         parallel = c("no", "multicore", "snow"), 
         ncpus = 1L, cl = NULL, ...)
         
add1TGD(object, scope, newdata, parameter = c("mu", "sigma", "nu", "tau"), 
        sorted = FALSE, trace = FALSE, 
        parallel = c("no", "multicore", "snow"), 
        ncpus = 1L, cl = NULL, ...)
stepTGD(object, scope, newdata, 
        direction = c("both", "backward", "forward"),
        trace = TRUE, keep = NULL, steps = 1000, 
        parameter = c("mu", "sigma", "nu", "tau"), 
        parallel = c("no", "multicore", "snow"), 
        ncpus = 1L, cl = NULL, ...)
        
stepTGDAll.A(object, scope = NULL, newdata = NULL, 
        steps = 1000, sigma.scope = NULL, nu.scope = NULL, 
        tau.scope = NULL, mu.try = TRUE, sigma.try = TRUE, 
        nu.try = TRUE, tau.try = TRUE,
        parallel = c("no", "multicore", "snow"), 
        ncpus = 1L, cl = NULL, ...)

Arguments

formula

A gamlss mu formula.

sigma.formula

Formula for sigma.

nu.formula

Formula for nu.

tau.formula

Formula for tau.

data

The data frame required for the fit.

family

The gamlss.family distribution.

control

The control for fitting the gamlss model.

rand

For gamlssVGD a variable with values 1 (for fitting) and 2 (for predicting). For gamlssCV a variable with k values indicating the cross validation sets.

newdata

The new data set (validation or test) for prediction.

object

A relevant R object.

scope

defines the range of models examined in the stepwise selection similar to stepGAIC() where you can see examples

sigma.scope

defines the range of models examined in the stepwise selection for sigma

nu.scope

defines the range of models examined in the stepwise selection for nu

tau.scope

defines the range of models examined in the stepwise selection for tau

mu.try

whether should try fitting models for mu

sigma.try

whether should try fitting models for sigma

nu.try

whether should try fitting models for nu

tau.try

whether should try fitting models for tau

parameter

which distribution parameter is required, default what="mu"

sorted

should the results be sorted on the value of TGD

trace

f TRUE additional information may be given on the fits as they are tried.

direction

The mode of stepwise search, can be one of both, backward, or forward, with a default of both. If the scope argument is missing the default for direction is backward

keep

see stepGAIC() for explanation

steps

the maximum number of steps to be considered. The default is 1000.

K.fold

the number of subsets of the data used

set.seed

the seed to be used in creating rand

parallel

The type of parallel operation to be used (if any). If missing, the default is "no".

ncpus

integer: number of processes to be used in parallel operation: typically one would chose this to the number of available CPUs.

An optional parallel or snow cluster for use if parallel = "snow". If not supplied, a cluster on the local machine is created for the duration of the call.

…

further arguments to be pass in the gamlss fit

Value

A fitted models of a set of global deviances.

Details

The function gamlssVGD() fits a gamlss model to the training data set determined by the arguments rand or newdata. The results is a gamlssVGD objects which contains the gamlss fit to the training data plus three extra components: i) VGD the global deviance applied to the validation data sets. ii) predictError which is VGD divided with the number of observations in the validation data set and iii) residVal the residuals for the validation data set.

The function VGD() extract the validated global deviance from one or more fitted gamlssVGD objects and can be used foe model comparison.

The function getTGD() operates different from the function gamlssVGD(). It assumes that the users already have fitted models using gamlss() and now he/she wants to evaluate the global deviance at a new (validation or test) data set.

The function TGD() extract the validated/test global deviance from one or more fitted gamlssTGD objects and can be use to compare models.

The gamlssCV() performs a k-fold cross validation on a gamlss models.

The function CV() extract the cross validated global deviance from one or more fitted gamlssCV objects and can be use to compare models.

The functions add1TGD(), drop1TGD() and stepTGD behave similar to add1(), drop1() and stepGAIC() functions respectively but they used validation or test deviance as the selection criterion rather than the GAIC.

References

Chambers, J. M. and Hastie, T. J. (1991). Statistical Models in S, Chapman and Hall, London.

Rigby, R. A. and Stasinopoulos D. M. (2005). Generalized additive models for location, scale and shape,(with discussion), Appl. Statist., 54, part 3, pp 507-554.

Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., and De Bastiani, F. (2019) Distributions for modeling location, scale, and shape: Using GAMLSS in R, Chapman and Hall/CRC. An older version can be found in https://www.gamlss.com/.

Stasinopoulos D. M. Rigby R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, Issue 7, Dec 2007, https://www.jstatsoft.org/v23/i07/.

Stasinopoulos D. M., Rigby R.A., Heller G., Voudouris V., and De Bastiani F., (2017) Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC.

(see also https://www.gamlss.com/).

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

Examples

Run this code

# NOT RUN {
data(abdom)
# generate the random split of the data
rand <- sample(2, 610, replace=TRUE, prob=c(0.6,0.4))
# the proportions in the sample
table(rand)/610
olddata<-abdom[rand==1,] # training data
newdata<-abdom[rand==2,] # validation data
#------------------------------------------------------------------------------
#  gamlssVGD
#-------------------------------------------------------------------------------
# Using rand
v1 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=NO, 
              rand=rand)
v2 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=LO, 
              rand=rand)
v3 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=TF, 
              rand=rand)
VGD(v1,v2,v3)
#-------------------------------------------------------------------------------
# }
# NOT RUN {
#-------------------------------------------------------------------------------
# using two data set
v11 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, 
                  family=NO, newdata=newdata)
v12 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, 
                 family=LO, newdata=newdata)
v13 <- gamlssVGD(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, 
                 family=TF, newdata=newdata)
VGD(v11,v12,v13)
#-------------------------------------------------------------------------------
# function getTGD
#-------------------------------------------------------------------------------
# fit gamlss models first
g1 <- gamlss(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, family=NO)
g2 <- gamlss(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, family=LO)
g3 <- gamlss(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=olddata, family=TF)
# and then use 
gg1 <-getTGD(g1, newdata=newdata)
gg2 <-getTGD(g2, newdata=newdata)
gg3 <-getTGD(g3, newdata=newdata)

TGD(gg1,gg2,gg3)
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
# function gamlssCV
#-------------------------------------------------------------------------------
set.seed(123)
rand1 <- sample (10 , 610, replace=TRUE)
g1 <- gamlssCV(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=NO, 
               rand=rand1)
g2 <- gamlssCV(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=LO, 
               rand=rand1)
g3 <- gamlssCV(y~pb(x,df=2),sigma.formula=~pb(x,df=1), data=abdom, family=TF, 
               rand=rand1)
CV(g1,g2,g3)
CV(g1)
# using parallel 
set.seed(123)
rand1 <- sample (10 , 610, replace=TRUE)
nC <- detectCores()

system.time(g21 <- gamlssCV(y~pb(x,df=2), sigma.formula=~pb(x,df=1), data=abdom,
             family=NO, rand=rand1,parallel = "no", ncpus = nC ))

system.time(g22 <- gamlssCV(y~pb(x,df=2), sigma.formula=~pb(x,df=1), data=abdom,
             family=LO, rand=rand1,parallel = "multicore", ncpus = nC ))

system.time(g23 <- gamlssCV(y~pb(x,df=2), sigma.formula=~pb(x,df=1), data=abdom,
             family=TF, rand=rand1,parallel = "snow", ncpus = nC ))


CV(g21,g22,g23)
#-------------------------------------------------------------------------------
# functions add1TGD() drop1TGD() and stepTGD()
#-------------------------------------------------------------------------------
# the data
data(rent)
rand <- sample(2, dim(rent)[1], replace=TRUE, prob=c(0.6,0.4))
# the proportions in the sample
table(rand)/dim(rent)[1]
oldrent<-rent[rand==1,] # training set
newrent<-rent[rand==2,] # validation set

# null model
v0 <- gamlss(R~1, data=oldrent, family=GA)
# complete model
v1 <- gamlss(R~pb(Fl)+pb(A)+H+loc, sigma.fo=~pb(Fl)+pb(A)+H+loc, 
             data=oldrent, family=GA)

# drop1TGDP
system.time(v3<- drop1TGD(v1, newdata=newrent,  parallel="no"))
system.time(v4<- drop1TGD(v1, newdata=newrent,  parallel="multicore", 
                          ncpus=nC) )
system.time(v5<- drop1TGD(v1, newdata=newrent,  parallel="snow", ncpus=nC))
cbind(v3,v4,v5)

# add1TGDP
system.time(d3<- add1TGD(v0,scope=~pb(Fl)+pb(A)+H+loc, newdata=newrent,  
                       parallel="no"))
system.time(d4<- add1TGD(v0,scope=~pb(Fl)+pb(A)+H+loc, newdata=newrent,  
                        parallel="multicore", ncpus=nC) )
system.time(d5<- add1TGD(v0, scope=~pb(Fl)+pb(A)+H+loc,newdata=newrent,  
                        parallel="snow", ncpus=nC))

# stepTGD
system.time(d6<- stepTGD(v0, scope=~pb(Fl)+pb(A)+H+loc,newdata=newrent))
system.time(d7<- stepTGD(v0, scope=~pb(Fl)+pb(A)+H+loc,newdata=newrent,
                         parallel="multicore", ncpus=nC))
system.time(d8<- stepTGD(v0, scope=~pb(Fl)+pb(A)+H+loc,newdata=newrent,
                         parallel="snow", ncpus=nC))
# }

Run the code above in your browser using DataLab

Get 50% off unlimited learning