fit: Fit a supervised data mining model (classification or regression) model

Description

Fit a supervised data mining model (classification or regression) model. Kind of a wrapper function that allows to fit distinct data mining methods under the same coherent function structure. Also, it tunes the hyperparameters of some models (e.g. knn, mlp, mlpe and svm) and performs some feature selection methods.

Usage

fit(x, data = NULL, model = "default", task = "default", 
    search = "heuristic", mpar = NULL, feature = "none", 
    scale = "default", transform = "none", 
    created = NULL, ...)

Arguments

a symbolic description (formula) of the model to be fit. If x contains the data, then data=NULL (similar to x in ksvm, kernlab package).

data

an optional data frame (columns denote attributes, rows show examples) containing the training data, when using a formula.

model

a character with the model type name (data mining method). Valid options are:

naivemost common class (classification) or mean output value (regression)
lr(orlogistic) -- logistic regres

task

data mining task. Valid options are:

prob(orp) -- classification with output probabilities (i.e. the sum of all outputs equals 1).
class(orc) -- classification with discrete

used to tune the hyperparameter(s) of the model (only for: knn -- number of neighbors (k); mlp or mlpe -- number of hidden nodes (H) or decay; svm -- gaussian kernel para

mpar

vector with extra model parameters (used for modeling, search and feature selection) with:

c(vmethod,vpar,metric) -- ifmodel=knnorrandomforest

feature

feature selection and sensitivity analysis control. Valid fit function options are:

none-- no feature selection;
a-vector-- vector with c(fmeth

scale

if data needs to be scaled (i.e. for mlp or mlpe). Valid options are:

default-- uses scaling when needed (i.e. formlpormlpe)
none-- no scaling;

transform

if the output data needs to be transformed (e.g. log transform). Valid options are:

none-- no transform;
log-- y=(log(y+1)) (the inverse function is applied in thepredict

created

time stamp for the model. By default, the system time is used. Else, you can specify another time.

...

additional and specific parameters send to each fit function model (e.g. dt, randomforest). For example, the rpart function is used for dt, thus you can add: control=rpart.control(

Value

Returns a model object. You can check all model elements with str(M), where M is a model object. The slots are:
- @formula-- thex;
- @model-- themodel;
- @task-- thetask;
- @mpar-- data.frame with the best model parameters (interpretation depends onmodel);
- @attributes-- the attributes used by the model;
- @scale-- thescale;
- @transform-- thetransform;
- @created-- the date when the model was created;
- @time-- computation effort to fit the model;
- @object-- the R object model (e.g.rpart,nnet, ...);
- @outindex-- the output index (of @attributes);
- @levels-- iftask=="prob"||task=="class"stores the output levels;

Details

Fits a classification or regression model given a data.frame (see [Cortez, 2010] for more details):

Neural Network:mlptrainsNrmultilayer perceptrons (withMeepochs,Hhidden nodes anddecayvalue according to thennetfunction) and selects the best network according to minimum penalized error ($value).mlpeuses an ensemble ofNrnetworks and the final prediction is given by the average of all outputs. To tunemlpormlpeyou can use thesearchparameter, which performs a grid search forHordecay.
Support Vector Machine:svmadopts the gaussian kernel. For classification tasks, you can usesearchto tunesigma(gaussian kernel parameter) andC(complexity parameter). For regression, the epsilon insensitive function is adopted and there is an additional hyperparameterepsilon.
Other methods: Random Forest -- if needed, you can tune themtryparameter usingsearch; k-nearest neighbor -- usesearchto tunek.

References

To check for more details about rminer and for citation purposes: P. Cortez. Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool. In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1. @Springer:http://www.springerlink.com/content/e7u36014r04h0334 http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
For the sabs feature selection: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. http://dx.doi.org/10.1016/j.dss.2009.05.016
For the uniform design details: C.M. Huang, Y.J. Lee, D.K.J. Lin and S.Y. Huang. Model selection for support vector machines via uniform design, In Computational Statistics & Data Analysis, 52(1):335-346, 2007.

Examples

Run this code

### simple regression (with a formula) example.
x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
M=fit(y~x1+x2,model="mlpe",search=2)
new1=rnorm(100,100,20); new2=rnorm(100,100,20)
ynew=0.7*sin(new1/(25*pi))+0.3*sin(new2/(25*pi))
P=predict(M,data.frame(x1=new1,x2=new2,y=rep(NA,100)))
print(mmetric(ynew,P,"MAE"))

### simple classification example.
data(iris)
M=fit(Species~.,iris,model="dt")
P=predict(M,iris)
print(mmetric(iris$Species,P,"CONF"))
print(mmetric(iris$Species,P,"ACC"))
print(mmetric(iris$Species,P,"AUC"))
print(metrics(iris$Species,P))
mgraph(iris$Species,P,graph="ROC",TC=2,main="versicolor ROC",
baseline=TRUE,leg="Versicolor",Grid=10)

### classification example with discrete classes, probabilities and holdout
H=holdout(iris$Species,ratio=2/3)
M=fit(Species~.,iris[H$tr,],model="svm",task="class")
M2=fit(Species~.,iris[H$tr,],model="svm",task="prob")
P=predict(M,iris[H$ts,])
P2=predict(M2,iris[H$ts,])
print(mmetric(iris$Species[H$ts],P,"CONF"))
print(mmetric(iris$Species[H$ts],P2,"CONF"))
print(mmetric(iris$Species[H$ts],P,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"AUC"))

### classification example with hyperparameter selection
# SVM 
M=fit(Species~.,iris,model="svm",search=2^-3,mpar=c(3)) # C=3, gamma=2^-3
print(M@mpar) # gamma, C, epsilon (not used here)
M=fit(Species~.,iris,model="svm",search="heuristic10") # 10 grid search for gamma
print(M@mpar) # gamma, C, epsilon (not used here)
M=fit(Species~.,iris,model="svm",search="heuristic10") # 10 grid search for gamma
print(M@mpar) # gamma, C, epsilon (not used here)
M=fit(Species~.,iris,model="svm",search=2^seq(-15,3,2),
      mpar=c(NA,NA,"holdout",2/3,"AUC")) # same 0 grid search for gamma
print(M@mpar) # gamma, C, epsilon (not used here)
search=svmgrid(task="prob") # grid search as suggested by the libsvm authors
M=fit(Species~.,iris,model="svm",search=search) # 
print(M@mpar) # gamma, C, epsilon (not used here)
M=fit(Species~.,iris,model="svm",search="UD") # 2 level 13 point uniform-design
print(M@mpar) # gamma, C, epsilon (not used here)
# MLPE
M=fit(Species~.,iris,model="mlpe",search="heuristic5") # 5 grid search for H
print(M@mpar)
M=fit(Species~.,iris,model="mlpe",search="heuristic5",
      mpar=c(3,100,"kfold",3,"AUC",2)) # 5 grid search for decay, inner 3-fold
print(M@mpar)
# faster grid search 
M=fit(Species~.,iris,model="mlpe",search=list(smethod="normal",convex=1,search=0:9)) 
print(M@mpar)
# 2 level grid with total of 5 searches
M=fit(Species~.,iris,model="mlpe",search=list(smethod="2L",search=c(4,8,12))) 
print(M@mpar)
# 2 level grid for decay
search=list(smethod="2L",search=c(0,0.1,0.2));mpar=c(3,100,"holdout",3,"AUC",2) 
M=fit(Species~.,iris,model="mlpe",search=search,mpar=mpar)
print(M@mpar)
### regression example
data(sin1reg)
M=fit(y~.,data=sin1reg,model="svm",search="heuristic")
P=predict(M,sin1reg)
print(mmetric(sin1reg$y,P,"MAE"))
mgraph(sin1reg$y,P,graph="REC",Grid=10)
# uniform design
M=fit(y~.,data=sin1reg,model="svm",search="UD")
print(M@mpar)
# sensitivity analysis feature selection
M=fit(y~.,data=sin1reg,model="svm",search="heuristic5",feature="sabs") 
print(M@mpar)
print(M@attributes) # selected attributes (1 and 2 are the relevant inputs)
P=predict(M,sin1reg); print(mmetric(sin1reg$y,P,"MAE"))
# sensitivity analysis feature selection
M=fit(y~.,data=sin1reg,model="mlp",search=2,feature=c("sabs",-1,1,"kfold",3)) 
print(M@mpar)
print(M@attributes)

M=fit(y~.,data=sin1reg,model="svm",search="heuristic")
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,y=NA)) # P should be negative...
print(P)
M=fit(y~.,data=sin1reg,model="svm",search="heuristic",transform="positive")
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,y=NA)) # P is not negative...
print(P)

Run the code above in your browser using DataLab