model.diagnostics: Model Predictions and Diagnostics

Description

Takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.

Usage

model.diagnostics(model.obj = NULL, qdata.trainfn = NULL, qdata.testfn = NULL, 
folder = NULL, MODELfn = NULL, response.name = NULL, unique.rowname = NULL,
 seed = NULL, prediction.type=NULL, MODELpredfn = NULL, na.action = NULL,
 v.fold = 10, device.type = NULL, DIAGNOSTICfn = NULL, jpeg.res = 72,
 device.width = 7,  device.height = 7, cex=par()$cex, req.sens, req.spec,
 FPC, FNC, n.trees = NULL)

Arguments

model.obj

R model object. The model object to use for prediction. The model object must be of type RF or SGB. (Eventually planned to include "GAM".)

qdata.trainfn

String. The name (full path or base name with path specified by folder) of the training data file used for building the model (file should include columns for both response and predictor variables). The file must be a comma-delimited file <

qdata.testfn

String. The name (full path or base name with path specified by folder) of the independent data set for testing (validating) the model's predictions. The file must be a comma-delimited file ".csv" with column headings and the c

folder

String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If folder = NULL (default), a GUI interface prompts user to browse to a folder. To use the working directory, specify folde

MODELfn

String. The file name to use to save the generated model object. If MODELfn = NULL (the default), a default name is generated by pasting model.type_response.type_response.name. If the other output filenames are left unspecified

response.name

String. The name of the response variable used to build the model. The response.name must be column name from the training/test data files. If the model.obj was constructed in ModelMap with the model.build()

unique.rowname

String.  The name of the unique identifier used to identify each row in the training data.  If unique.rowname = NULL, a GUI interface prompts user to select a variable from the list of column names from the training data file.  If uniqu

seed

Integer.  The number used to initialize randomization to build RF or SGB models.  If you want to produce the same model later, use the same seed.  If seed = NULL (the default), a new seed is created each run.

prediction.type

String. Prediction type.  "TEST", "CV", "OOB" or "TRAIN".  If predict = "TEST", validation predictions will be made on the test set provided by qdata.testfn.  If predict =

MODELpredfn

String.  Model validation.  A character string used to construct the output file names for the validation diagnostics, for example the prediction *.csv file, and the graphics *.jpg, *.pdf and *.ps files.

na.action

String.  Model validation.  Specifies the action to take if there are NA values in the prediction or response data or if there is a level or class of a categorical predictor variable in the validation test set, but not in the training data se

v.fold

Integer (or logical FALSE).  Model validation.  The number of cross validation folds to use when making validation predictions on the training data.  Only used if  prediction.type = "CV".

device.type

String or vector of strings.  Model validation.  One or more device types for graphical output from model validation diagnostics. 

Current choices:

lllll{
	  			"default" 	default graphics device
			"jpeg"

DIAGNOSTICfn

String.  Model validation.  Name used as base to create names for output files from model validation diagnostics.  The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folde

jpeg.res

Integer.  Model validation.  Pixels per inch for jpeg plots.  The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi.

device.width

Integer. Model validation. The device width for diagnostic plots in inches.

device.height

Integer. Model validation. The device height for diagnostic plots in inches.

cex

Integer. Model validation. The cex for diagnostic plots.

req.sens

Numeric. Model validation. The required sensitivity for threshold optimization for binary response model evaluation.

req.spec

Numeric. Model validation. The required specificity for threshold optimization for binary response model evaluation.

FPC

Numeric. Model validation. The False Positive Cost for threshold optimization for binary response model evaluation.

FNC

Numeric. Model validation. The False Negative Cost for threshold optimization for binary response model evaluation.

n.trees

Integer.  SGB models.  The number of stochastic gradient boosting trees for an SGB model. If n.trees=NULL (the default) the model creation code will increase the number of trees 100 at a time until OOB error rate stops improving. The gb

`Value`

The function will return a dataframe of the row ID, and the Observed and predicted values. 

For Binary response models the predicted probability of presence is returned. 

For Categorical Response models the predicted category (by majority vote) is returned as well as a column for each category giving the probability of that category. If necessary, make.names is applied to the categories to create valid column names.

For Continuous response models the predicted value is returned. 

If prediction.type = "CV" the dataframe also includes a column indicating which cross-validation fold each datapoint was in.

`Details`

model.diagnostics()takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.

model.diagnostics() can be run in a traditional R command mode, where all arguments are specified in the function call.  However it can also be used in a full push button mode, where you type in the simple command model.map(), and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...

When running model.map() on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list() function, which is platform independent. 

Diagnostic predictions are made my one of four methods, and a text file is generated consisting of three columns: Observation ID, observed values and predicted values. If predition.type = "CV") an additional column indicates which cross-fold each observation fell into. If the models response type is categorical then in addition a column giving the category predicted by majority vote, there are also categories for each possible response category giving the proportion of trees that predicted that category.

A variable importance graph is made. If response.type = "categorical", category specific graphs are generated for variable importance. These show how much the model accuracy for each category is affected when the values of each predictor variable is randomly permuted.

If model.type = "RF", the OOB error is plotted as a function of number of trees in the model. If response.type = "binary" or If response.type = "categorical" category specific graphs are generated for  OOB error as a function of number of trees.

If response.type = "binary", a summary graph is made using the PresenceAbsence package and a *.csv spreadsheets are created of optimized thresholds by several methods with their associated error statistics, and predicted prevalence.

If response.type = "continuous" a scatterplot of observed vs.  predicted is created with a simple linear regression line.  The graph is labeled with slope and intercept of this line as well as Pearson's and Spearman's correlation coefficients.

`References`

Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.

Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.

Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Ann. Stat., 29(5):1189-1232.

Friedman, J.H. (2002). Stochastic gradient boosting. Comput. Stat. Data An., 38(4):367-378.

Liaw, A. and  Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181

`See Also`

get.test, model.build, model.mapmake

`Examples`

Run this code###########################################################################
############################# Run this set up code: #######################
###########################################################################

# set seed:
seed=38

# Define training and test files:

qdata.trainfn = system.file("external", "helpexamples","DATATRAIN.csv", package = "ModelMap")
qdata.testfn = system.file("external", "helpexamples","DATATEST.csv", package = "ModelMap")

# Define folder for all output:
folder=getwd()	

#identifier for individual training and test data points

unique.rowname="ID"


###########################################################################
############## Pick one of the following sets of definitions: #############
###########################################################################


########## Continuous Response, Continuous Predictors ############

#file name to store model:
MODELfn="RF_Bio_TC"				

#predictors:
predList=c("TCB","TCG","TCW")	

#define which predictors are categorical:
predFactor=FALSE	

# Response name and type:
response.name="BIO"
response.type="continuous"


########## binary Response, Continuous Predictors ############

#file name to store model:
MODELfn="RF_CONIFTYP_TC"				

#predictors:
predList=c("TCB","TCG","TCW")		

#define which predictors are categorical:
predFactor=FALSE

# Response name and type:
response.name="CONIFTYP"

# This variable is 1 if a conifer or mixed conifer type is present, 
# otherwise 0.

response.type="binary"


########## Continuous Response, Categorical Predictors ############

# In this example, NLCD is a categorical predictor.
#
# You must decide what you want to happen if there are categories
# present in the data to be predicted (either the validation/test set
# or in the image file) that were not present in the original training data.
# Choices:
#       na.action =  "na.omit"
#                    Any validation datapoint or image pixel with a value for any
#                    categorical predictor not found in the training data will be
#                    returned as NA.
#       na.action =  "na.roughfix"
#                    Any validation datapoint or image pixel with a value for any
#                    categorical predictor not found in the training data will have
#                    the most common category for that predictor substituted,
#                    and the a prediction will be made.

# You must also let R know which of the predictors are categorical, in other
# words, which ones R needs to treat as factors.
# This vector must be a subset of the predictors given in predList

#file name to store model:
MODELfn="RF_BIO_TCandNLCD"			

#predictors:
predList=c("TCB","TCG","TCW","NLCD")

#define which predictors are categorical:
predFactor=c("NLCD")

# Response name and type:
response.name="BIO"
response.type="continuous"



###########################################################################
########################### build model: ##################################
###########################################################################


### create model ###

model.obj = model.build( model.type="RF",
                       qdata.trainfn=qdata.trainfn,
                       folder=folder,		
                       unique.rowname=unique.rowname,	
                       MODELfn=MODELfn,
                       predList=predList,
                       predFactor=predFactor,
                       response.name=response.name,
                       response.type=response.type,
                       seed=seed,
                       na.action="na.roughfix"
)

###########################################################################
#### Then Run this code make validation predictions and diagnostics: ######
###########################################################################


### for Out-of-Bag predictions ###

MODELpredfn<-paste(MODELfn,"_OOB",sep="")
PRED.OOB<-model.diagnostics( 	model.obj=model.obj,
				qdata.trainfn=qdata.trainfn,
                   		folder=folder,		
                  	 	unique.rowname=unique.rowname,
                	# Model Validation Arguments
                   		prediction.type="OOB",
                   		MODELpredfn=MODELpredfn,
                   		device.type=c("default","jpeg","pdf"),	
                   		na.action="na.roughfix"
)
PRED.OOB

### for Cross-Validation predictions ###

MODELpredfn<-paste(MODELfn,"_CV",sep="")
PRED.CV<-model.diagnostics( 	model.obj=model.obj,
                   		qdata.trainfn=qdata.trainfn,
                   		folder=folder,		
                   		unique.rowname=unique.rowname,
                   		seed=seed,
                	# Model Validation Arguments
                   		prediction.type="CV",
                   		MODELpredfn=MODELpredfn,
                   		device.type=c("default","jpeg","pdf"),	
                   		v.fold=10,
                   		na.action="na.roughfix"
)
PRED.CV

### for Independent Test Set predictions ###

MODELpredfn<-paste(MODELfn,"_TEST",sep="")
PRED.TEST<-model.diagnostics( 	model.obj=model.obj,
                   		qdata.testfn=qdata.testfn,
                   		folder=folder,		
                   		unique.rowname=unique.rowname,
                	# Model Validation Arguments
                   		prediction.type="TEST",
                   		MODELpredfn=MODELpredfn,
                   		device.type=c("default","jpeg","pdf"),	
                   		na.action="na.roughfix"
)
PRED.TEST
Run the code above in your browser using DataLab