model.map: Model Building and Map making

Description

Create sophisticated models of training data and validate the models with an independant test set, cross validation, or in the case of Random Forest Models, with Out OF Bag (OOB) predictions on the training data. It will creat graphs and tables of the model validation results. It will apply these models to GIS .img files of predictors to create detailed prediction surfaces. It will handle large predictor files for map making, by reading in the .img files in chuncks, and output to the .txt file the prediction for each data chunk, before reading the next chenk of data.

Usage

model.map(model.obj = NULL, model.type = NULL, qdata.trainfn = NULL, qdata.testfn = NULL, folder = NULL, MODELfn = NULL, rastLUT = NULL, rastLUTfn = NULL, rastnmVector = NULL, predList = NULL, predFactor = FALSE, response.name = NULL, response.type = NULL, unique.rowname = NULL, seed = NULL, predict = NULL, MODELpredfn = NULL, na.action = "na.omit", v.fold = NULL, diagnostics = predict, device.type = NULL, DIAGNOSTICfn = NULL, jpeg.res = 72, device.width = 7,  device.height = 7, cex=par()$cex, req.sens, req.spec, FPC, FNC, ntree = 500, mtry = NULL, n.trees = NULL, shrinkage = 0.001, interaction.depth = 10, bag.fraction = 0.5, train.fraction = 1, n.minobsinnode = 10, map = NULL, numrows = 500, map.sd = FALSE, asciifn = NULL, asciifn.mean = NULL, asciifn.stdev = NULL, asciifn.coefv = NULL)

Arguments

model.obj

R model object. The model object to use for prediction, if the model has been previously created. The model object must be of type RF or SGB. (Eventually planned to include "GAM".) If NULL (the default), a model is generated

model.type

String. Model type. "RF" or "SGB". (Eventually planned to include "GAM".) If model.obj is specified, the model.type will be extracted from model.obj, and the argument m

qdata.trainfn

String. The name (full path or base name with path specified by folder) of the training data file used for building the model (file should include columns for both response and predictor variables). The file must be a comma-delimited file <

qdata.testfn

String. The name (full path or base name with path specified by folder) of the independent data set for testing (validating) the model's predictions. The file must be a comma-delimited file ".csv" with column headings and the c

folder

String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If folder = NULL (default), a GUI interface prompts user to browse to a folder. To use the working directory, specify folde

MODELfn

String. The file name to use to save the generated model object. If MODELfn = NULL (the default), a default name is generated by pasting model.type_response.type_response.name. If the other output filenames are left unspecified

rastLUT

Dataframe. A data frame of raster information used to make a map. This data frame can be an R object or read in from a comma-delimited file using the example code below. The rastLUT must be given if a map is desired (map

rastLUTfn

String. The file name (full path or base name with path specified by folder) of a .csv file for a rastLUT. This file must follow the format described above for rastLUT. It is not necessary to specify bo

rastnmVector

Vector of character Strings. The file names and paths (for Imagine Image files) or folder names and paths (for ArcInfo Grids) of rasters to be used to generate a selection list of possible predictors. Only needed if you are creating a model, and pr

predList

String. A character vector of the predictor short names used to build the model. These names must match the column names in the training/test data files and the names in column two of the rastLUT. If predList = NULL (the defau

predFactor

String. A character vector of predictor short names of the predictors from predList that are factors (i.e categorical predictors). These must be a subset of the predictor names given in predList Categorical predictors may have

response.name

String. The name of the response variable used to build the model. If response.name = NULL, a GUI interface prompts user to select a variable from the list of column names from training data file. response.name must be column

response.type

String. Response type: "binary" or "continuous". binary response must be binary 0/1 variable with only 2 categories. All zeros will be treated as one category, and everything else will be treated as the second category.

unique.rowname

String. The name of the unique identifier used to identify each row in the training data. If unique.rowname = NULL, a GUI interface prompts user to select a variable from the list of column names from the training data file. If uniqu

seed

Integer. The number used to initialize randomization to build RF or SGB models. If you want to produce the same model later, use the same seed. If seed = NULL (the default), a new seed is created each run.

predict

Logical. Model validation. If predict = TRUE, validation predictions will be made. If predict = FALSE, no validation predictions will be made. If predict = TRUE, a *.csv file of the unique id, observ

MODELpredfn

String. Model validation. The name of the validation prediction *.csv file. Only used if predict = TRUE. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder speci

na.action

String. Model validation. Specifies the action to take if there are NA values in the prediction data or if there is a level or class of a ctegorical predictor variable in the validation test set or the production (mapping) data set, but not

v.fold

Integer (or logical FALSE). Model validation. The number of cross validation folds to use when making validation predictions on the training data. If set to v.fold = FALSE and no test data is supplied, validation predictions w

diagnostics

Logical. Model validation. If diagnostics = TRUE, the following diagnostic statistics and graphs will be generated for validation predictions: A variable importance graph is made. If response.type = "binary", a summary

device.type

String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics. Current choices: lllll{ "default" default graphics device "jpeg"

DIAGNOSTICfn

String. Model validation. Name used as base to create names for output files from model validation diagnostics. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folde

jpeg.res

Integer. Model validation. Pixels per inch for jpeg plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi.

device.width

Integer. Model validation. The device width for diagnostic plots in inches.

device.height

Integer. Model validation. The device height for diagnostic plots in inches.

cex

Integer. Model validation. The cex for diagnostic plots.

req.sens

Numeric. Model validation. The required sensitivity for threshold optimization for binary response model evaluation.

req.spec

Numeric. Model validation. The required specificity for threshold optimization for binary response model evaluation.

FPC

Numeric. Model validation. The False Positive Cost for threshold optimization for binary response model evaluation.

FNC

Numeric. Model validation. The False Negative Cost for threshold optimization for binary response model evaluation.

ntree

Integer. RF models. The number of random forest trees for a RF model. The default is 500 trees.

mtry

Integer. RF models. Number of variables to try at each node of Random Forest trees. By default, will use the "tuneRF()" function to optimize mtry.

n.trees

Integer. SGB models. The number of stochastic gradient boosting trees for an SGB model. If n.trees=NULL (the default) the model creation code will increase the number of trees 100 at a time until OOB error rate stops improving. The gb

shrinkage

Numeric. SGB models. A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction.

interaction.depth

Integer. SGB models. The maximum depth of variable interactions. interaction.depth = 1 implies an additive model, interaction.depth = 2 implies a model with up to 2-way interactions, etc...

bag.fraction

Numeric. SGB models. bag.fraction must be a number between 0 and 1, giving the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses in

train.fraction

Numeric. SGB models. The first train.fraction * nrows(data) observations are used to fit the model and the remainder are used for computing out-of-sample estimates of the loss function.

n.minobsinnode

Integer. SGB models. Minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight.

map

Logical. Map Production. If map = TRUE, predictions will be made across the extent of the raster layers. If map = FALSE, no predictions will be made. If map = NULL (the default), a GUI window will prompt you to se

numrows

Integer. Map Production. The number of rows to be predicted at a time.

map.sd

Logical. Map Production. If map.sd = TRUE, maps of mean, standard deviation, and coefficient of variation of the predictions from all the trees are generated for each pixel. If map.sd = FALSE (the default), only the predicted

asciifn

String. Map Production. Filename of output file for map production. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folder. If asciifn = NULL (th

asciifn.mean

String. Map Production. Used if map.sd = TRUE and response.type = "continuous". Filename of output file for mean of trees. The filename can be the full path, or it can be the simple basename, in which case the output will be t

asciifn.stdev

String. Map Production. Used if map.sd = TRUE and response.type = "continuous". Filename of output file for standard deviation of trees. The filename can be the full path, or it can be the simple basename, in which case the ou

asciifn.coefv

String. Map Production. Used if map.sd = TRUE and response.type = "continuous". Filename of output file for coefficient of variation of trees. The filename can be the full path, or it can be the simple basename, in which case

Value

The function will return the model object. Additionally, depending on the options selected, it may also write several different things to disk, in the folder specified by folder. These include:
model.obj = NULLthe R model object
predict = TRUE.csv file of observed and predicted values
diagnostics = TRUEvariable importance and summary graphs of file type specified by device.type (i.e. .jpg, .ps, or .emf files)
diagnostics = TRUE & response.type = "binary".csv file of thresholds optimized by multiple different criteria
map = TRUE.txt files of map information suitable to be imported into GIS
.txt file giving the values of each argument as choosen from GUI propts used for the function call

Details

This package provides a push button appraoch to complex model building and production mapping. It contains two functions: a simple function get.test() that can be used to radomly divide a training dataset into training and test/validation sets; and the workhorse, "do every thing" function nick named "The Button", and called with model.map(). model.map() can be run in a traditional R command mode, where all arguments are specified in the function call. However it can also be used in a full push button mode, where you type in the simple command model.map(), and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc... When running model.map() on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list() function, which is platform independent. Random Forest is implemented through the randomForest package within R. Random Forest is more user friendly than Stochastic Gragient Boosting, as it has fewer parameters to be set by the user, and is less sensitive to tuning of these parameters. A Random Forest model consists of multiple trees that vote on predictions. For each tree a random subset of the training data is used to construct the tree, with the remaining data points used to construct out-of-bag (OOB) error estimates. At each node of the tree a random selection of predictors is chosen to determine the split. The number of predictors used to select the splits (argument mtry) is the primary user specified parameter that can affect model performance. By default this parameter will be automatically optimized using the tuneRF() function. Random Forest will not over fit data, therefore the only penalty of increasing the number of trees is computation time. Random Forest can compute variable importance, an advantage over some "black box" modeling techniques if it is important to understand the ecological relationships underlying a model (Brieman, 2001). Stochastic gradient boosting (Friedman 2001, 2002), is related to both boosting and bagging. Many small classification or regression trees are built sequentially from "pseudo"-residuals (the gradient of the loss function of the previous tree). At each iteration, a tree is built from a random sub-sample of the dataset (selected without replacement) and an incremental improvement in the model. Using only a fraction of the training data increases both the computation speed and the prediction accuracy, while also helping to avoid over-fitting the data. An advantage of stochastic gradient boosting is that it is not necessary to pre-select or transform predictor variables. It is also resistant to outliers, as the steepest gradient algorithm emphasizes points that are close to their correct classification. Stochastic gradient boosting is implemented through the gbm package within R. One disadvantege of Stochastic Gradient Boosting, compared to Random Forest, is increased number of user specified parameters, and the SGB models tend to be more sensitive to these parameters. Model fitting parameter options include distribution, interaction depth, bagging fraction, shrinkage rate, and training fraction. These parameters can be set in the argument list when calling model.map(). Values for these parameters other than the defaults can not be set by point and click in the GUI pop up windows, and must be set in the argument list when calling model.map(). Friedman (2001, 2002) and Ridgeway (1999) provide guidelines on appropriate settings for model fitting options. Also, unlike Random Forest models, in Stochastic Gradient Boosting, there is a penaly for using too many trees. The default behavior in model.map() is to increase the number of trees 100 at a time until the model stops improving, then call the gbm subfunction gbm.perf(method="OOB") to select the best number of iterations. ALternatively, the model.map() argument ntrees can be used to set some large number of trees to be calculated all at once and, again, the gbm.perf(method="OOB") function will be used to select the best number of trees. Note that the gbm package warns that

"OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive."

The gbm package offers two alternative techniques for calculating the best number of trees, but these are not yet implemented in the ModelMap package, as they require the use of a formula interface for model building. For Presence-Absence data, the package PresenceAbsence is used for model validation. For map making, the package rgdal is used to read .img files. The data for production mapping should be in the form of pixel-based raster layers representing the predictors in the model. If there is more than one predictor or raster layer, the layers must all have the same number of columns and rows. The layers must also have the same extent, projection, and pixel size, for effective model development and accuracy. The layers must also be in either ESRI Grid or ERDAS Imagine image (single or multi-band) raster data formats, having continuous or categorical data values. The R package rgdal is used to read spatial rasters into R. When creating maps of non-rectangular study regions there may be large portions of the rectangle where you have no predictors, and are unintrested in making predictions. The suggeted value for the pixels outside the study area is -9999. These pixels will be ignored in the predictions, thus saving computing time, and will be exported as -9999. Any value other than -9999 will be treated as a legal data value and a prediction will be generated for each pixel. Note: in Imagine image files, if the specified NODATA is set as -9999, any -9999 pixels will be read into R as NA, and if na.action = "na.roughfix", predicitons will be attempted for these pixels. This will cause the computation time to increase, and these predictions will need to be masked out when the final map is imported back into a GIS sytem. The function model.map() outputs an ASCII grid file of map information suitable to be imported into a GIS. Small maps can also be imported back into R using the function read.asciigrid() from the sp package.

References

Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32. Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Ann. Stat., 29(5):1189-1232. Friedman, J.H. (2002). Stochastic gradient boosting. Comput. Stat. Data An., 38(4):367-378. Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18--22. Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181

Examples

Run this code

###########################################################################
############################# Run this set up code: #######################
###########################################################################

# set seed:
seed=38

# Define training and test files:

qdata.trainfn = paste(system.file(package="ModelMap"),"/external/DATATRAIN.csv",sep="")
qdata.testfn  = paste(system.file(package="ModelMap"),"/external/DATATEST.csv",sep="")

# Define folder for all output:
folder=getwd()	

# Create a list of the filenames (including paths) for the rast Look up Tables:
rastLUTfn=list( paste(system.file(package="ModelMap"),"/external/LUT_2001.csv",sep=""),
                paste(system.file(package="ModelMap"),"/external/LUT_2004.csv",sep=""))


# Load rast LUT tables, and add path to the filenames in column 1:
rastLUT<-lapply(rastLUTfn, function(x){	y <- read.table(x,header=FALSE,sep=",",stringsAsFactors=FALSE)
                                        y[,1] <- paste(system.file(package="ModelMap"),"external",y[,1],sep="/")
                                        return(y)})

# Define identifier for individual training and test data points:
unique.rowname="ID"

# Define Number of rows of raster to read in at one time
# if crashes with warning: "unable to assign..." lower this number

numrows=500


###########################################################################
############## Pick one of the following sets of definitions: #############
###########################################################################


########## Continuous Response, Continuous Predictors ############

#file name to store model:
MODELfn="RF_Bio_TC"

#file name for validation predictions:						
MODELpredfn="RF_Bio_TC_PRED.csv"				

#names from column 2 of rastLUT:
predList=c("TCB","TCG","TCW")	

#define which predictors are categorical:
predFactor=FALSE	

# Response name and type:
response.name="BIO"
response.type="continuous"


# Map name:
asciifn<-c("RF_Bio_TC_01.txt","RF_Bio_TC_04.txt")
asciifn<-paste(folder,asciifn,sep="/")

########## binary Response, Continuous Predictors ############

#file name to store model:
MODELfn="RF_CONIFTYP_TC"

#file name for validation predictions:						
MODELpredfn="RF_CONIFTYP_TC.csv"				

#names from column 2 of rastLUT:
predList=c("TCB","TCG","TCW")		

#define which predictors are categorical:
predFactor=FALSE

# Response name and type:
response.name="CONIFTYP"

# This variable is 1 if a conifer or mixed conifer type is present, 
# otherwise 0.

response.type="binary"


# Map name:
asciifn<-c("RF_CONIFTYP_TC_01.txt","RF_CONIFTYP_TC_04.txt")
asciifn<-paste(folder,asciifn,sep="/")

########## Continuous Response, Categorical Predictors ############

# In this example, NLCD is a categorical predictor.
#
# You must decide what you want to happen if there are categories
# present in the data to be predicted (either the validation/test set
# or in the image file) that were not present in the original training data.
# Choices:
#       na.action = "na.omit"
#                    Any validation datapoint or image pixel with a value for any
#                    categorical predictor not found in the training data will be
#                    returned as NA.
#       na.action = "na.roughfix"
#                    Any validation datapoint or image pixel with a value for any
#                    categorical predictor not found in the training data will have
#                    the most common category for that predictor substituted,
#                    and the a prediction will be made.

# You must also let R know which of the predictors are categorical, in other
# words, which ones R needs to treat as factors.
# This vector must be a subset of the predictors given in predList

#file name to store model:
MODELfn="RF_BIO_TCandNLCD"

#file name for validation predictions:						
MODELpredfn="RF_BIO_TCandNLCD_PRED.csv"				

#names from column 2 of rastLUT:
predList=c("TCB","TCG","TCW","NLCD")

#define which predictors are categorical:
predFactor=c("NLCD")

# Response name and type:
response.name="BIO"
response.type="continuous"


# Map name:
asciifn<-c(	"RF_BIO_TCandNLCD_01.txt","RF_BIO_TCandNLCD_04.txt")
asciifn<-paste(folder,asciifn,sep="/")

###########################################################################
############### Then run this code to building model: #####################
###########################################################################


### create model before batching (only run this code once ever!) ###

model.obj = model.map( model.obj=NULL,
                       model.type="RF",
                       qdata.trainfn=qdata.trainfn,
                       qdata.testfn=qdata.testfn,
                       folder=folder,		
                       MODELfn=MODELfn,
                       rastLUT=rastLUT[[1]],
                       predList=predList,
                       predFactor=predFactor,
                       response.name=response.name,
                       response.type=response.type,
                       unique.rowname=unique.rowname,
                       seed=seed,
                # Model Validation Arguments
                       predict=FALSE,
                # Mapping arguments
                       map=FALSE
)

###########################################################################
#### Then Run this code make validation predictions and diagnostics: ######
###########################################################################

model.obj = model.map( model.obj=model.obj,
                       qdata.trainfn=qdata.trainfn,
                       qdata.testfn=qdata.testfn,   #set qdata.testfn=FALSE to use OOB on training data
                       folder=folder,		
                       MODELfn=MODELfn,
                       rastLUT=rastLUT[[1]],
                       predList=predList,
                       predFactor=predFactor,
                       response.name=response.name,
                       response.type=response.type,
                       unique.rowname=unique.rowname,
                       seed=seed,
                # Model Validation Arguments
                       predict=TRUE,
                       diagnostics=TRUE,
                       DIAGNOSTICfn=MODELfn,
                       device.type=c("jpeg","pdf"),	
                       MODELpredfn=MODELpredfn,
                       v.fold=FALSE,
                       na.action="na.roughfix",
                # Mapping arguments
                       map=FALSE
)


###########################################################################
################# Then Run this code to create maps: ######################
###########################################################################

### button - Batch (must have already created model) ###

load(paste(folder,"/",MODELfn,sep=""))

for(i in 1:length(rastLUTfn)){

        print("##########################################################")
        print(paste("Starting",asciifn[i]))
        print("##########################################################")

        model.obj = model.map( model.obj=model.obj,
                               folder=folder,		
                               rastLUT=rastLUT[[i]],
                               seed=seed,
                         # Model Validation Arguments
                               predict=FALSE,	
                               na.action="na.roughfix",
                         # Mapping arguments
                               map=TRUE,
                               numrows = numrows,						
                               asciifn=asciifn[i]
                               )
}

Run the code above in your browser using DataLab