## S3 method for class 'forestFloor':
plot(x,
plot_seq=NULL,
limitY=TRUE,
order_by_importance=TRUE,
cropXaxes=NULL,
crop_limit=4,
compute_GOF = FALSE,
plot_GOF = NULL,
GOF_col = "#33333399",
...)
Why not just plot the raw data as is, in various ways? Because even when noise component is low, the plotted 2D shadow of multiple non-linear interactions can look merely random. randomForest provides interpretation of what there is the signal and of individual contributions/components related to each variable. What would otherwise become a messy indistinguishable cloud of interactions and additive effects and noise.
system = signal + noise = additive component + interacting components + noise
This pseudo-equation explains how a high dimensional system can be segrated into smaller parts which in turn can be visulized. The random forest provides an interpretation of what is signal and feature contributions of the individual feature related components. Adittive components can be plotted as is by this function. Interacting components must be plotted in the context of two or more features, see show3d_new
The root of computation of the partial interpretations, is that treesbased models, recursively process the data univariately - one dicision by one variable at the time. This leads to a multitude of interpretations where randomForest is the average of many trees and feature contributions is the sampled topology of this fit to the data. Secondly random forest is robust and thereby removing a good deal of the noise component. Random trends will across many trees average eachother out, wheres reproducible trends will stay. Furthermore default paremeter-set of randomForest-algorithm is stable and do not need much optimising. Other methods may later supercede in predictions performance. If an underlying system is near linear, GLM will always perform better. But before blindly choosing between e.g. RF, SVM and GLM solely by prediction performance, an idea of a expected system topology should be highly appreciated. For an archtypical 'conservative classical statistician', this package would be funpark trip, only inspirring to what models should tested out to be significant. Could any of the variables be transformed to become linaer? What interactions-terms should be included?
Interpreting partial feature/variable contributions can be an inspirring or alleuring tool. Plausible causality links comes to mind when interactions are mapped. Remember to also concider reverse and external causality or that sampling was not truely indpendent, e.g. Simpson Paradox. Furthermore oberserved interactions can be due to general statistical issues, such as high colinearity, low sample size and non-independant sampling and more. Often some usaul supects can be ruled out as implausible, while others could be similar plausible and only further testing could possibly tell. Remember these topology maps are created by just another oppurtunistic algorithms, only trying to please its loss function. When data is sparse, randomForest tends to yield topologies with few interactions and with steap sigmoid partial functions. This does not mean the underlying system is exactly as such, just that no more complex interpretations was stable and reproducible across the many treees. Furthermore tree models do not extrapolate outside featurespace of trainingset. Any datapoint outside will be predicted as the most resembling datapoints inside the feature space. The amplitude of suggested partialfunctions is lowered in the near boarder of feature space until et becomes a flatline. This make linear effects look more or less sigmoid. This effect is very well displayed with simulated sinfunctions which have large amplitudes in the center of feature space and small gradually dimishing to no amplitude close to the border of feature space. Also normal distribted variables increase this effect of soft border in the feature space. In high dimensions featurespace of uncorrealted variales will have a low density, and if the variables also are normal distributed, the density of datapoints at the borders of features space will be even more sparse. Personally I think that any data-driven model should return to conservative estimates in low density areas of feature space, which is what randomForest in generally do. Nonetheless, as randomForest is not completely robust, sometimes some complex topologies can emerge by chance in a low density areas of feature space. Here it is important to look at the amount data point driving this specifc topology. With a given prior belief is this a plausible and/or probable topology? On th otherhand when firm topologies are found and in areas of many data points, the rule of large numbers must apply, and this should emphasised more closely.
#simulate data
obs=1000
vars = 6
X = data.frame(replicate(vars,rnorm(obs)))
Y = with(X, X1^2 + sin(X2*pi) + 2 * X3 * X4 + 0.5 * rnorm(obs))
#grow a forest, remeber to include inbag
rfo=randomForest::randomForest(X,Y,keep.inbag=TRUE)
#compute topology
ff = forestFloor(rfo,X)
#print forestFloor
print(ff)
#plot partial functions of most important variables first
plot(ff,order_by_importance=TRUE)
#Non interacting functions are well displayed, whereas X3 and X4 are not
#by applying different colourgradient, interactions reveal themself
Col=fcol(ff,3,orderByImportance=FALSE)
plot(ff,col=Col)
#in 3D the interaction between X3 and X reveals itself completely
show3d_new(ff,3:4,col=Col)
#although no interaction, a joined additive effect of X1 and X2
#can also be informative to display in 3D
Col = fcol(ff,2,X.matrix=FALSE,orderByImportance=FALSE)
plot(ff,col=Col) #use plot first to define colours
show3d_new(ff,1:2,
col=Col,
plot.rgl=list(size=4),
surf.rgl=list(col=c("red","green")))
Run the code above in your browser using DataLab