CoreModel: Build a classification or regression model

Description

Builds a classification or regression model from the data and formula with given parameters. Classification models available are

random forests, possibly with local weighing of basic models (parallel execution on several cores),
decision tree with constructive induction in the inner nodes and/or models in the leaves,
kNN and weighted kNN with Gaussian kernel,
naive Bayesian classifier.

Regression models:

regression trees with constructive induction in the inner nodes and/or models in the leaves,
linear models with pruning techniques,
locally weighted regression,
kNN and weighted kNN with Gaussian kernel.

Usage

CoreModel(formula, data, model=c("rf","rfNear","tree","knn","knnKernel","bayes","regTree"), ..., costMatrix=NULL)

Arguments

formula

Either a formula specifying the attributes to be evaluated and the target variable, or a name of target variable, or an index of target variable.

data

Data frame with training data.

model

The type of model to be learned.

...

Options for building the model. See helpCore.

costMatrix

Optional cost matrix used with certain models.

Value

The created model is not returned as a structure. It is stored internally in the package memory space and only its pointer (index) is returned. The maximum number of models that can be stored simultaneously is a parameter of the initialization function initCore and defaults to 16384. Models, which are not needed, may be deleted in order to free the memory using function destroyModels. By referencing the returned model, any of the stored models may be used for prediction with predict.CoreModel. What the function actually returns is a list with components:

Details

The parameter formula can be interpreted in three ways, where the formula interface is the most elegant one, but inefficient and inappropriate for large data sets. See also examples below. As formula one can specify:

Parameter model controls the type of the constructed model. There are several possibilities:

There are many additional parameters ... available which are used by different models. Their list and description is available by calling helpCore. Evaluation of attributes is covered in function attrEval.

The optional parameter costMatrix can provide nonuniform cost matrix for classification problems. For regression problem this parameter is ignored. The format of the matrix is costMatrix(true class, predicted class). By default uniform costs are assumed, i.e., costMatrix(i, i) = 0, and costMatrix(i, j) = 1, for i not equal to j.

References

Marko Robnik-Sikonja, Igor Kononenko: Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning Journal, 53:23-69, 2003

Leo Breiman: Random Forests. Machine Learning Journal, 45:5-32, 2001

Marko Robnik-Sikonja: Improving Random Forests. In J.-F. Boulicaut et al.(Eds): ECML 2004, LNAI 3210, Springer, Berlin, 2004, pp. 359-370

Marko Robnik-Sikonja: CORE - a system that predicts continuous variables. Proceedings of ERK'97 , Portoroz, Slovenia, 1997

Marko Robnik-Sikonja, Igor Kononenko: Discretization of continuous attributes using ReliefF. Proceedings of ERK'95, B149-152, Ljubljana, 1995

Majority of these references are available from http://lkm.fri.uni-lj.si/rmarko/papers/

Examples

Run this code

# use iris data set
trainIdxs <- sample(x=nrow(iris), size=0.7*nrow(iris), replace=FALSE)
testIdxs <- c(1:nrow(iris))[-trainIdxs]

# build random forests model with certain parameters
# setting maxThreads to 0 or more than 1 forces utilization of several processor cores 
modelRF <- CoreModel(Species ~ ., iris[trainIdxs,], model="rf",
              selectionEstimator="MDL",minNodeWeightRF=5,
              rfNoTrees=100, maxThreads=1)
print(modelRF) # simple visualization, test also others with function plot
pred <- predict(modelRF, iris[testIdxs,], type="both") # prediction on testing set
mEval <- modelEval(modelRF, iris[["Species"]][testIdxs], pred$class, pred$prob)
print(mEval) # evaluation of the model
# visualization of individual predictions and the model
## Not run: 
# require(ExplainPrediction)
# explainVis(modelRF, iris[trainIdxs,], iris[testIdxs,], method="EXPLAIN",visLevel="model",
#            problemName="iris", fileType="none", classValue=1, displayColor="color") 
# # turn on the history in visualization window to see all instances
# explainVis(modelRF, iris[trainIdxs,], iris[testIdxs,], method="EXPLAIN",visLevel="instance",
#            problemName="iris", fileType="none", classValue=1, displayColor="color") 
# ## End(Not run)
destroyModels(modelRF) # clean up


# build decision tree with naive Bayes in the leaves
# more appropriate for large data sets one can specify just the target variable

modelDT <- CoreModel("Species", iris, model="tree", modelType=4)
print(modelDT)
destroyModels(modelDT) # clean up


# build regression tree similar to CART
instReg <- regDataGen(200)
modelRT <- CoreModel(response~., instReg, model="regTree", modelTypeReg=1)
print(modelRT)
destroyModels(modelRT) # clean up

# build kNN kernel regressor by preventing tree splitting
modelKernel <- CoreModel(response~., instReg, model="regTree",
                    modelTypeReg=7, minNodeWeightTree=Inf)
print(modelKernel)
destroyModels(modelKernel) # clean up

## Not run: 
# # A more complex example 
# # Test accuracy of random forest predictor with 20 trees on iris data
# # using 10-fold cross-validation.
# ncases <- nrow(iris)
# ind <- ceiling(10*(1:ncases)/ncases)
# ind <- sample(ind,length(ind))
# pred <- rep(NA,ncases)
# fit <- NULL
# for (i in unique(ind)) {
#     # Delete the previous model, if there is one.
#     fit <- CoreModel(Species ~ ., iris[ind!=i,], model="rf", rfNoTrees=20, maxThreads=1)
#     pred[ind==i] <- predict(fit, iris[ind==i,], type="class")
#     if (!is.null(fit)) destroyModels(fit) # dispose model no longer needed
#  
# }
# table(pred,iris$Species)
# ## End(Not run)

Run the code above in your browser using DataLab