model.explore: Exploratory data analysis

Description

Graphically explores the relationships between the training data and the predictor rasters.

Usage

model.explore(qdata.trainfn = NULL, folder = NULL, predList = NULL, 
predFactor = FALSE, response.name = NULL, response.type = NULL, 
response.colors = NULL, unique.rowname = NULL, OUTPUTfn = NULL, 
device.type = NULL, allow.default.graphics=FALSE, res=NULL, jpeg.res = 72, 
MAXCELL=100000, device.width = NULL, device.height = NULL, units="in", 
pointsize=12, cex=1, rastLUTfn = NULL, create.extrapolation.masks = FALSE, 
na.value = -9999, col.ramp = rainbow(101, start = 0, end = 0.5), 
col.cat = palette()[-1])

Arguments

qdata.trainfn

String. The name (full path or base name with path specified by folder) of the training data file used for building the model (file should include columns for both response and predictor variables). The file must be a comma-delimited file <

folder

String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If folder = NULL (default), a GUI interface prompts user to browse to a folder. To use the working directory, specify folde

predList

String. A character vector of the predictor short names used to build the model. These names must match the column names in the training/test data files and the names in column two of the rastLUT. If predList = NULL (the defau

predFactor

String. A character vector of predictor short names of the predictors from predList that are factors (i.e categorical predictors). These must be a subset of the predictor names given in predList Categorical predictors may have

response.name

String. The name of the response variable used to build the model. If response.name = NULL, a GUI interface prompts user to select a variable from the list of column names from training data file. response.name must be column

response.type

String. Response type: "binary", "categorical" or "continuous". Binary response must be binary 0/1 variable with only 2 categories. All zeros will be treated as one category, and everything else will be treated as

response.colors

Data frame. A two column data frame. Column names must be:category, the response categories; and, color, the colors associated with each category.

unique.rowname

String. The name of the unique identifier used to identify each row in the training data. If unique.rowname = NULL, a GUI interface prompts user to select a variable from the list of column names from the training data file. If unique

OUTPUTfn

String. Filename that ouput file names will be based on.

device.type

String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics. Current choices: lllll{ "default" default graphics device "jpeg"

allow.default.graphics

Logical. Should the default on-screen graphics device be allowed. USE WITH CAUTION! These graphics are complicated and slow to produce. If the on-screen default graphics device is moved or closed before the plot is completed it can crash the entire R ses

res

Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi.

jpeg.res

Integer. Graphical output. Deprecated. Ignored unless res not provided.

MAXCELL

Integer. Graphical output. The maximum number of raster cells used to create the graphical output. Rasters larger than this value will be subsampled for the graphical maps and figures. The default value of MAXCELL=100000 is generally a good

device.width

Integer. Model validation. The device width for diagnostic plots in inches.

device.height

Integer. Model validation. The device height for diagnostic plots in inches.

units

Model validation. The units in which device.height and device.width are given. Can be "px" (pixels), "in" (inches, the default), "cm" or "mm".

pointsize

Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at res ppi

cex

Integer. Model validation. The cex for diagnostic plots.

rastLUTfn

String. The file name (full path or base name with path specified by folder) of a .csv file for a rastLUT. Alternatively, a dataframe containing the same information. The rastLUT must include 3 columns:

create.extrapolation.masks

Logical. If TRUE then the raster brick containing the masks for all predictors from predList is saved as image file. The layers in this file will be in the same order as the predictors in predList

na.value

Value used in rasters to indicate NA. Note this value is only used for NA values in the predictor rasters. Note: all predictor rasters must use the same value for NA. NA values in the training data sho

col.ramp

Color ramp to use for continuous predictors

col.cat

Vector. Vector of colors to use for categorical predictors.

Value

Function does not return a value, but does create files. Graphical files are created for each predictor variable, with file type determined by device.type. In addition, if create.extrapolation.masks, an extrapolation mask raster is created for each predictor as well as an overall extrapolation mask, with the value 1 for pixels with predictor values within the range of the training data, or categories found in the training data, and the value 0 for pixels outside the range of the training data, categories not found in the training data, or NA value. The overall extrapolation mask has 0 if any of the predictors for that pixel are extrapolated. Note that this option is much slower to run.

Details

The model.explore function is intended to aid with preliminary data exploration before model building. It includes graphical tools to explore the relationships between the training data (both predictors and responses) as well as the predictor rasters. It uses the corrplot package to create a correlation plot of the continuous predictor. This can aid in interpreting the model.importance.plot output from the models, as Random Forest models divide importance between correlated predictors, while Stochastic Gradient Boosting models assing the majority of the importance to the correlated predictor that is used earlies in the model. The model.explore function also can aid in identifying if additional training data is needed. For example, the maps of the extrapolation masks for the predictor rasters help spot areas of the map where the pixels lie outside the range of the training data, and therefore any model predictions will be extrapolations, and possibly unreliable. The user can decide to either collect additional training data, or mask out these areas of the final prediction output of model.mapmake. To increase speed, the default behavior for large predictor rasters is to create the graphics from subsampled rasters. (Note: for categorical predictors, the full raster is always used to identify all categories found in the map area.) If create.extrapolation.masks=TRUE, then the full rasters are used for the extrapolation masks, regardless of size of the reasters. This option runs much slower, as large rasters need to be read into R a block at a time.

Examples

Run this code

###########################################################################
############################# Run this set up code: #######################
###########################################################################

###Define training and test files:
qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap")

###Define folder for all output:
folder=getwd()	

###identifier for individual training and test data points
unique.rowname="ID"			

###predictors:
predList=c("TCB","TCG","TCW","NLCD")

###define which predictors are categorical:
predFactor=c("NLCD")

###Create a the filename (including path) for the rast Look up Tables ###
rastLUTfn.2001 <- system.file(	"extdata",
				"helpexamples",
				"LUT_2001.csv",
				package="ModelMap")

###Load rast LUT table, and add path to the predictor raster filenames in column 1 ###
rastLUT.2001 <- read.table(rastLUTfn.2001,header=FALSE,sep=",",stringsAsFactors=FALSE)

for(i in 1:nrow(rastLUT.2001)){
	rastLUT.2001[i,1] <- system.file("extdata",
					"helpexamples",
					rastLUT.2001[i,1],
					package="ModelMap")
}                                 

#################Continuous Response###################

###Response name and type:
response.name="BIO"
response.type="continuous"

###file name to store model:
OUTPUTfn="BIO_TCandNLCD.img"

###run model.explore

model.explore(	qdata.trainfn=qdata.trainfn,
		folder=folder,		
		predList=predList,
		predFactor=predFactor,

		response.name=response.name,
		response.type=response.type,
	
		unique.rowname=unique.rowname,

		OUTPUTfn=OUTPUTfn,
		device.type="jpeg",
		jpeg.res=144,


		# Raster arguments
		rastLUTfn=rastLUT.2001,
		na.value=-9999,

		# colors for continuous predictors
		col.ramp=rainbow(101,start=0,end=.5), 
		# colors for categorical predictors
		col.cat=c("wheat1","springgreen2","darkolivegreen4",
			  "darkolivegreen2","yellow","thistle2",
			  "brown2","brown4")
)

Run the code above in your browser using DataLab