plotMS: Model series plot

Description

This function produces a graphical output that allows the examination of the effect of using different model specifications (design) on the predictive performance of these models (a model series). It generally is used to access the results of functions buildMS and statsMS, but can be easily adapted to work with any model structure and performance measure.

Usage

plotMS(obj, grid, line, ind, type = c("b", "g"), pch = c(20, 2),
  size = 0.5, arrange = "desc", color = NULL, xlim = NULL,
  ylab = NULL, xlab = NULL, at = NULL, ...)

Arguments

obj

Object of class data.frame, generally returned by statsMS, containing a 1) series of performance statistics of several models, and 2) the design information of each model. See ‘Details’ for more information.

grid

Vector of integer values or character strings indicating the columns of the data.frame containing the design data which will be gridded using the function levelplot. See ‘Details’ for more information.

line

Character string or integer value indicating which of the performance statistics (usually calculated by statsMS) should be plotted using the function xyplot. See ‘Details’ for more information.

ind

Integer value indicating for which group of models the mean rank is to be calculated. See ‘Details’ for more information.

type

Vector of character strings indicating some of the effects to be used when plotting the performance statistics using xyplot. Defaults to type = c("b", "g"). See panel.xyplot for more information on how to set this argument.

pch

Vector with two integer values specifying the symbols to be used to plot points. The first sets the symbol used to plot the performance statistic, while the second sets the symbol used to plot the mean rank of the indicator set using argument ind. Defaults to pch = c(20, 2). See points for possible values and their interpretation.

size

Numeric value specifying the size of the symbols used for plotting the mean rank of the indicator set using argument ind. Defaults to size = 0.5. See grid.points for more information.

arrange

Character string indicating how the model series should be arranged, which can be in ascending (asc) or descending (desc) order. Defaults to arrange = "desc". See arrange for more information.

color

Vector defining the colors to be used in the grid produced by function levelplot. If NULL, defaults to color = cm.colors(n), where n is the number of unique values in the columns defined by argument grid. See cm.colors to see how to use other color palettes.

xlim

Numeric vector of length 2, giving the x coordinates range. If NULL (which is the recommended value), defaults to xlim = c(0.5, dim(obj)[1] + 0.5). This is, so far, the optimum range for adequate plotting.

ylab

Character vector of length 2, giving the y-axis labels. When obj is a data.frame returned by statsMS, and the performance statistic passed to argument line is one of those calculated by statsMS ("candidates", "df", "aic", "rmse", "nrmse", "r2", "adj_r2" or "ADJ_r2"), the function tries to automatically identify the correct ylab.

xlab

Character vector of length 1, giving the x-axis labels. Defaults to xlab = "Model ranking".

Numeric vector indicating the location of tick marks along the x axis (in native coordinates).

...

Other arguments for plotting, although most of these have no been tested. Argument asp, for example, is not effective since the function automatically identifies the best aspect for plotting based on the dimensions of the design data.

Value

An object of class "trellis" consisting of a model series plot.

Warning

Use the original functions xyplot and levelplot for higher customization.

Details

This section gives more details about arguments obj, grid, line, arrange, and ind.

obj

The argument obj usually constitutes a data.frame returned by statsMS. However, the user can use any data.frame object as far as it contains the two basic units of information needed:

design data passed with argument grid
performance statistic passed with argument line

grid

The argument grid indicates the design data which is used to produce the grid output in the top of the model series plot. By design we mean the data that specify the structure of each model and how they differ from each other. Suppose that eight linear models were fit using three types of predictor variables (a, b, and c). Each of these predictor variables is available in two versions that differ by their accuracy, where 0 means a less accurate predictor variable, while 1 means a more accurate predictor variable. This yields 2^3 = 8 total possible combinations. The design data would be of the following form: > design a b c 1 0 0 0 2 0 0 1 3 0 1 0 4 1 0 0 5 0 1 1 6 1 0 1 7 1 1 0 8 1 1 1

line

The argument line corresponds to the performance statistic that is used to arrange the models in ascending or descending order, and to produce the line output in the bottom of the model series plot. For example, it can be a series of values of adjusted coefficient of determination, one for each model:

adj_r2 <- c(0.87, 0.74, 0.81, 0.85, 0.54, 0.86, 0.90, 0.89)

arrange

The argument arrange automatically arranges the model series according to the performance statistics selected with argument line. If obj is a data.frame returned by statsMS(), then the function uses standard arranging approaches. For most performance statistics, the models are arranged in descending order. The exception is when "r2", "adj_r2" or "ADJ_r2" are used, in which case the models are arranged in ascending order. This means that the model with lowest value appears in the leftmost side of the model series plot, while the models with the highest value appears in the rightmost side of the plot.

> arrange(obj, adj_r2) id a b c adj_r2 1 5 1 0 1 0.54 2 2 0 0 1 0.74 3 3 1 0 0 0.81 4 4 0 1 0 0.85 5 6 0 1 1 0.86 6 1 0 0 0 0.87 7 8 1 1 1 0.89 8 7 1 1 0 0.90

This results suggest that the best performing model is that of id = 7, while the model of id = 5 is the poorest one.

ind

The model series plot allows to see how the design influences model performance. This is achieved mainly through the use of different colours in the grid output, where each unique value in the design data is represented by a different colour. For the example given above, one could try to see if the models built with the more accurate versions of the predictor variables have a better performance by identifying their relative distribution in the model series plot. The models placed at the rightmost side of the plot are those with the best performance.

The argument ind provides another tool to help identifying how the design, more specifically how each variable in the design data, influences model performance. This is done by simply calculating the mean ranking of the models that were built using the updated version of each predictor variable. This very same mean ranking is also used to rank the predictor variables and thus identify which of them is the most important.

After arranging the design data described above using the adjusted coefficient of determination, the following mean rank is obtained for each predictor variable:

> rank_center a b c 1 5.75 6.25 5.25

This result suggests that the best model performance is obtained when using the updated version of the predictor variable b. In the model series plot, the predictor variable b appears in the top row, while the predictor variable c appears in the bottom row.

References

Deepayan Sarkar (2008). Lattice: Multivariate Data Visualization with R. Springer, New York. ISBN 978-0-387-75968-5.

Roger D. Peng (2008). A method for visualizing multivariate time series data. Journal of Statistical Software. v. 25 (Code Snippet), p. 1-17.

Roger D. Peng (2012). mvtsplot: Multivariate Time Series Plot. R package version 1.0-1. http://CRAN.R-project.org/package=mvtsplot.

Examples

Run this code

# NOT RUN {
# This example follows the discussion in section "Details"
# Note that the data.frame is created manually
id <- c(1:8)
design <- data.frame(a = c(0, 0, 1, 0, 1, 0, 1, 1),
                     b = c(0, 0, 0, 1, 0, 1, 1, 1),
                     c = c(0, 1, 0, 0, 1, 1, 0, 1))
adj_r2 <- c(0.87, 0.74, 0.81, 0.85, 0.54, 0.86, 0.90, 0.89)
obj <- cbind(id, design, adj_r2)
p <- plotMS(obj, grid = c(2:4), line = "adj_r2", ind = 1, 
            color = c("lightyellow", "palegreen"),
            main = "Model Series Plot")
print(p)

# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Warning

Details

obj

grid

line

arrange

ind

References

See Also

Examples