mining: Powerful function that trains and tests a particular fit model under several runs and a given validation method

Description

Powerful function that trains and tests a particular fit model under several runs and a given validation method. Since there can be a huge number of models, the fitted models are not stored. Yet, several useful statistics (e.g. predictions) are returned.

Usage

mining(x, data = NULL, Runs = 1, method = NULL, model = "default", 
       task = "default", search = "heuristic", mpar = NULL,
       feature="none", scale = "default", transform = "none", 
       debug = FALSE, ...)

Arguments

a symbolic description (formula) of the model to be fit. If x contains the data, then data=NULL (similar to x in ksvm, kernlab package).

data

an optional data frame (columns denote attributes, rows show examples) containing the training data, when using a formula.

Runs

number of runs used (e.g. 1, 5, 10, 20, 30)

method

a vector with c(vmethod,vpar,seed) or c(vmethod,vpar,window,increment), where vmethod is:

all -- all NROW examples are used as both training and test sets (no vpar or seed is needed).
holdout -- standard holdout method. If vpar<1 then NROW*vpar random samples are used for training and the remaining rows are used for testing. Else, then NROW*vpar random samples are used for testing and the remaining are used for training. For classification tasks (prob or class) a stratified sampling is assumed (equal to mode="stratified" in holdout).
holdoutrandom -- similar to holdout except that assumes always a random sampling (not stratified).
holdoutorder -- similar to holdout except that instead of a random sampling, the first rows (until the split) are used for training and the remaining ones for testing (equal to mode="order" in holdout).
holdoutinc -- incremental holdout retraining (e.g. used for stock market data). Here, vpar is the test size, window is the initial window size and increment is the number of samples added at each iteration. Note: argument Runs is automatically set when this option is used. See also holdout.
holdoutrol -- rolling holdout retraining (e.g. used for stock market data). Here, vpar is the test size, window is the window size and increment is the number of samples added at each iteration. Note: argument Runs is automatically set when this option is used. See also holdout.
kfold -- K-fold cross-validation method, where vpar is the number of folds. For classification tasks (prob or class) a stratified split is assumed (equal to mode="stratified" in crossvaldata).
kfoldrandom -- similar to kfold except that assumes always a random sampling (not stratified).
kfoldorder -- similar to kfold except that instead of a random sampling, the order of the rows is used to build the folds.

vpar -- number used by vmethod (optional, if not defined 2/3 for holdout and 10 for kfold is assumed); and seed (optional, if not defined then NA is assumed) is:

NA -- random seed is adopted (default R method for generating random numbers);
a vector of size Runs with fixed seed numbers for each Run;
a number -- set.seed(number) is applied then a vector of seeds (of size Runs) is generated.

model

See fit for details.

task

See fit for details.

mpar

Only kept for compatibility with previous rminer versions, as you should use search instead of mpar. See fit for details.

feature

See fit for more details about feature="none", "sabs" or "sbs" options. For the mining function, additional options are feature=fmethod, where fmethod can be one of:

sens or sensg -- compute the 1-D sensitivity analysis input importances ($sen), gradient measure.
sensv -- compute the 1-D sensitivity analysis input importances ($sen), variance measure.
sensr -- compute the 1-D sensitivity analysis input importances ($sen), range measure.
simp, simpg or s -- equal to sensg but also computes the 1-D sensitivity responses ($sresponses, useful for graph="VEC").
simpv -- equal to sensv but also computes the 1-D sensitivity responses (useful for graph="VEC").
simpr -- equal to sensr but also computes the 1-D sensitivity responses (useful for graph="VEC").

scale

See fit for details.

transform

See fit for details.

debug

If TRUE shows some information about each run.

…

See fit for details.

Value

A list with the components:

$object -- fitted object values of the last run (used by multiple model fitting: "auto" mode). For "holdout", it is equal to a fit object, while for "kfold" it is a list.
$time -- vector with time elapsed for each run.
$test -- vector list, where each element contains the test (target) results for each run.
$pred -- vector list, where each element contains the predicted results for each test set and each run.
$error -- vector with a (validation) measure (often it is a error value) according to search$metric for each run (valid options are explained in mmetric).
$mpar -- vector list, where each element contains the fit model mpar parameters (for each run).
$model -- the model.
$task -- the task.
$method -- the external validation method.
$sen -- a matrix with the 1-D sensitivity analysis input importances. The number of rows is Runs times vpar, if kfold, else is Runs.
$sresponses -- a vector list with a size equal to the number of attributes (useful for graph="VEC"). Each element contains a list with the 1-D sensitivity analysis input responses (n -- name of the attribute; l -- number of levels; x -- attribute values; y -- 1-D sensitivity responses. Important note: sresponses (and "VEC" graphs) are only available if feature="sabs" or "simp" related (see feature).
$runs -- the Runs.
$attributes -- vector list with all attributes (features) selected in each run (and fold if kfold) if a feature selection algorithm is used.
$feature -- the feature.

Details

Powerful function that trains and tests a particular fit model under several runs and a given validation method (see [Cortez, 2010] for more details). Several Runs are performed. In each run, the same validation method is adopted (e.g. holdout) and several relevant statistics are stored. Note: this function can require some computational effort, specially if a large dataset and/or a high number of Runs is adopted.

References

To check for more details about rminer and for citation purposes: P. Cortez. Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool. In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1. @Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44 http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples: P. Cortez. A tutorial on using the rminer R package for data mining tasks. Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015. http://hdl.handle.net/1822/36210
For the grid search and other optimization methods: P. Cortez. Modern Optimization with R. Use R! series, Springer, September 2014, ISBN 978-3-319-08262-2. https://www.springer.com/gp/book/9783319082622

Examples

Run this code

# NOT RUN {
### dontrun is used when the execution of the example requires some computational effort.

### simple regression example
set.seed(123); x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
# mining with an ensemble of neural networks, each fixed with size=2 hidden nodes
# assumes a default holdout (random split) with 2/3 for training and 1/3 for testing:
M=mining(y~x1+x2,Runs=2,model="mlpe",search=2)
print(M)
print(mmetric(M,metric="MAE"))

### more regression examples:
# }
# NOT RUN {
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# 5 runs of an external holdout with 2/3 for training and 1/3 for testing, fixed seed 12345
# feature selection: sabs method
# model selection: 5 searches for size, internal 2-fold cross validation fixed seed 123
#                  with optimization for minimum MAE metric 
M=mining(y~.,data=sin1reg,Runs=5,method=c("holdout",2/3,12345),model="mlpe",
         search=list(search=mparheuristic("mlpe",n=5),method=c("kfold",2,123),metric="MAE"),
         feature="sabs")
print(mmetric(M,metric="MAE"))
print(M$mpar)
print("median hidden nodes (size) and number of MLPs (nr):")
print(centralpar(M$mpar))
print("attributes used by the model in each run:")
print(M$attributes)
mgraph(M,graph="RSC",Grid=10,main="sin1 MLPE scatter plot")
mgraph(M,graph="REP",Grid=10,main="sin1 MLPE scatter plot",sort=FALSE)
mgraph(M,graph="REC",Grid=10,main="sin1 MLPE REC")
mgraph(M,graph="IMP",Grid=10,main="input importances",xval=0.1,leg=names(sin1reg))
# average influence of x1 on the model:
mgraph(M,graph="VEC",Grid=10,main="x1 VEC curve",xval=1,leg=names(sin1reg)[1])
# }
# NOT RUN {
### regression example with holdout rolling windows:
# }
# NOT RUN {
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# rolling with 20 test samples, training window size of 300 and increment of 50 in each run:
# note that Runs argument is automatically set to 14 in this example:
M=mining(y~.,data=sin1reg,method=c("holdoutrol",20,300,50),
         model="mlpe",debug=TRUE)
# }
# NOT RUN {
### regression example with all rminer models:
# }
# NOT RUN {
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","mr","mars",
         "cubist","pcr","plsr","cppls","rvm")
for(model in models)
{ 
 M=mining(y~.,data=sin1reg,method=c("holdout",2/3,12345),model=model)
 cat("model:",model,"MAE:",round(mmetric(M,metric="MAE")$MAE,digits=3),"\n")
}
# }
# NOT RUN {
### classification example (task="prob")
# }
# NOT RUN {
data(iris)
# 10 runs of a 3-fold cross validation with fixed seed 123 for generating the 3-fold runs
M=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="rpart")
print(mmetric(M,metric="CONF"))
print(mmetric(M,metric="AUC"))
print(meanint(mmetric(M,metric="AUC")))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
       main="versicolor ROC")
mgraph(M,graph="LIFT",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
       main="Versicolor ROC")
M2=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="ksvm")
L=vector("list",2)
L[[1]]=M;L[[2]]=M2
mgraph(L,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg=c("DT","SVM"),main="ROC")
# }
# NOT RUN {
### other classification examples
# }
# NOT RUN {
### 1st example:
data(iris)
# 2 runs of an external 2-fold validation, random seed
# model selection: SVM model with rbfdot kernel, automatic search for sigma,
#                  internal 3-fold validation, random seed, minimum "AUC" is assumed
# feature selection: none, "s" is used only to store input importance values
M=mining(Species~.,data=iris,Runs=2,method=c("kfold",2,NA),model="ksvm",
         search=list(search=mparheuristic("ksvm"),method=c("kfold",3)),feature="s")

print(mmetric(M,metric="AUC",TC=2))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="SVM",main="ROC",intbar=FALSE)
mgraph(M,graph="IMP",TC=2,Grid=10,main="input importances",xval=0.1,
leg=names(iris),axis=1)
mgraph(M,graph="VEC",TC=2,Grid=10,main="Petal.Width VEC curve",
data=iris,xval=4)
### 2nd example, ordered kfold, k-nearest neigbor:
M=mining(Species~.,iris,Runs=1,method=c("kfoldo",3),model="knn")
# confusion matrix:
print(mmetric(M,metric="CONF"))

### 3rd example, use of all rminer models: 
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","bagging",
         "boosting","lda","multinom","naiveBayes","qda")
models="naiveBayes"
for(model in models)
{ 
 M=mining(Species~.,iris,Runs=1,method=c("kfold",3,123),model=model)
 cat("model:",model,"ACC:",round(mmetric(M,metric="ACC")$ACC,digits=1),"\n")
}
# }
# NOT RUN {
### multiple models: automl or ensembles 
# }
# NOT RUN {
data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"

# simple automl (1 search per individual model),
# internal holdout and external holdout:
sm=mparheuristic(model="automl",n=NA,task="prob",inputs=inputs)
mode="auto"

imethod=c("holdout",4/5,123) # internal validation method
emethod=c("holdout",2/3,567) # external validation method

search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# 1 single model was selected:
cat("best",emethod[1],"selected model:",M$object@model,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# simple automl (1 search per individual model),
# internal kfold and external kfold: 
imethod=c("kfold",3,123) # internal validation method
emethod=c("kfold",5,567) # external validation method
search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# kfold models were selected:
kfolds=as.numeric(emethod[2])
models=vector(length=kfolds)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# example with weighted ensemble:
M=mining(y~.,data=d,model="WE",search=search,method=emethod,fdebug=TRUE)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")

# }
# NOT RUN {

### for more fitting examples check the help of function fit: help(fit,package="rminer")
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

References

See Also

Examples