Learn R Programming

toaster (version 0.5.5)

showData: Plot table level statistics, histograms, correlations and scatterplots in one go.

Description

showData is the basic plotting function in the toaster package, designed to produce set of standard visualizations (see parameter format) in a single call. Depending on the format it is a wrapper to other functions or simple plotting function. It does all work in a single call by combining database round-trip (if necessary) and plotting functionality.

Usage

showData(channel = NULL, tableName = NULL, tableInfo = NULL, include = NULL, except = NULL, type = "numeric", format = "histgoram", measures = NULL, title = paste("Table", toupper(tableName), format, "of", type, "columns"), numBins = 30, useIQR = FALSE, extraPoints = NULL, extraPointShape = 15, sampleFraction = NULL, sampleSize = NULL, pointColour = NULL, facetName = NULL, regressionLine = FALSE, corrLabel = "none", digits = 2, shape = 21, shapeSizeRange = c(1, 10), facet = ifelse(format == "overview", TRUE, FALSE), scales = ifelse(facet & format == "boxplot", "free", ifelse(facet & format == "overview", "free_y", "fixed")), ncol = 4, coordFlip = FALSE, paletteName = "Set1", baseSize = 12, baseFamily = "sans", legendPosition = "none", defaultTheme = theme_tufte(base_size = baseSize, base_family = baseFamily), themeExtra = NULL, where = NULL, test = FALSE)

Arguments

channel
connection object as returned by odbcConnect
tableName
Aster table name
tableInfo
pre-built summary of data to use (parameters channel, tableName, where may not apply depending on format). See getTableSummary.
include
a vector of column names to include. Output never contains attributes other than in the list.
except
a vector of column names to exclude. Output never contains attributes from the list.
type
what type of data to visualize: numerical ("numeric"), character ("character" or date/time ("temporal")
format
type of plot to use: 'overview', 'histogram', 'boxplot', 'corr' for correlation matrix or 'scatterplot'
measures
applies to format 'overview' only. Use one or more of the following with 'numieric' type: maximum,minimum,average,deviation,0 type: distinct_count,not_null_count. By default all measures above are used per respeictive type.
title
plot title
numBins
number of bins to use in histogram(s)
useIQR
logical indicates use of IQR interval to compute cutoff lower and upper bounds for values to be included in boxplot or histogram: [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR], IQR = Q3 - Q1, if FALSE then maximum and minimum are bounds (all values)
extraPoints
vector contains names of extra points to add to boxplot lines.
extraPointShape
extra point shape (see 'Shape examples' in aes_linetype_size_shape).
sampleFraction
sample fraction to use in the sampling of data for 'scatterplot'
sampleSize
if sampleFraction is not specified then size of sample must be specified for 'scatterplot'.
pointColour
name of column with values to colour points in 'scatterplot'.
facetName
name(s) of the column(s) to use for faceting when format is 'scatterplot'. When single name then facet wrap kind of faceting is used. When two names then facet grid kind of faceting is used. It overrides facet value in case of 'scatterplot'. Must be part of column list (e.g. include).
regressionLine
logical if TRUE then adds regression line to scatterplot.
corrLabel
column name to use to label correlation table: 'value', 'pair', or 'none' (default)
digits
number of digits to use in correlation table text (when displaying correlation coefficient value)
shape
shape of correlation figure (default is 21)
shapeSizeRange
correlation figure size range
facet
Logical - if TRUE then divide plot into facets for each COLUMN (defualt is FALSE - no facets). When set to TRUE and format is 'boxplot' scales defalut changes from 'fixed' to 'free'. Has no effect when format is 'corr'.
scales
Are scales shared across all facets: "fixed" - all are the same, "free_x" - vary across rows (x axis), "free_y" - vary across columns (Y axis) (default), "free" - both rows and columns (see in facet_wrap parameter scales. Also see parameter facet for details on default values.)
ncol
Number of columns in facet wrap.
coordFlip
logical flipped cartesian coordinates so that horizontal becomes vertical, and vertical, horizontal (see coord_flip).
paletteName
palette name to use (run display.brewer.all to see available palettes).
baseSize
base font size.
baseFamily
base font family.
legendPosition
legend position.
defaultTheme
plot theme to use, default is theme_bw.
themeExtra
any additional ggplot2 theme attributes to add.
where
SQL WHERE clause limiting data from the table (use SQL as if in WHERE clause but omit keyword WHERE).
test
logical: when applicable if TRUE show what would be done, only (similar to parameter test in RODBC functions like sqlQuery and sqlSave). Doesn't apply when no sql expected to run, e.g. format is 'boxplot'.

Value

a ggplot object

Details

All formats support parameters include and except to include and exclude table columns respectively. The include list guarantees that no columns outside of the list will be included in the results. The excpet list guarantees that its columns will not be included in the results.

Format overview: produce set of histograms - one for each statistic measure - across table columns. Thus, it allows to compare averages, IQR, etc. across all or selected columns.

Format boxplot: produce boxplots for table columns. Boxplots can belong to the same plot or can be placed inside facet each (see logical parameter facet).

Format histogram: produce histograms - one for each column - in a single plot or in facets (see logical parameter facet).

Format corr: produce correlation matrix of numeric columns.

Format scatterplot: produce scatterplots of sampled data.

Examples

Run this code
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

# get summaries to save time
pitchingInfo = getTableSummary(conn, 'pitching_enh')
battingInfo = getTableSummary(conn, 'batting_enh')

# Boxplots
# all numerical attributes
showData(conn, tableInfo=pitchingInfo, format='boxplot', 
         title='Boxplots of numeric columns')
# select certain attributes only
showData(conn, tableInfo=pitchingInfo, format='boxplot', 
         include=c('wp','whip', 'w', 'sv', 'sho', 'l', 'ktobb', 'ibb', 'hbp', 'fip', 
                   'era', 'cg', 'bk', 'baopp'), 
         useIQR=TRUE, title='Boxplots of Pitching Stats')
# exclude certain attributes
showData(conn, tableInfo=pitchingInfo, format='boxplot', 
         except=c('item_id','ingredient_item_id','facility_id','rownum','decadeid','yearid',
                  'bfp','ipouts'),
         useIQR=TRUE, title='Boxplots of Pitching Stats')
# flip coordinates
showData(conn, tableInfo=pitchingInfo, format='boxplot', 
         except=c('item_id','ingredient_item_id','facility_id','rownum','decadeid','yearid',
                  'bfp','ipouts'),
         useIQR=TRUE, coordFlip=TRUE, title='Boxplots of Pitching Stats')

# boxplot with facet (facet_wrap)
showData(conn, tableInfo=pitchingInfo, format='boxplot',
         include=c('bfp','er','h','ipouts','r','so'), facet=TRUE, scales='free',
         useIQR=TRUE, title='Boxplots Pitching Stats: bfp, er, h, ipouts, r, so')

# Correlation matrix
# on all numerical attributes
showData(conn, tableName='pitching_enh', tableInfo=pitchingInfo, 
         format='corr')

# correlation matrix on selected attributes
# with labeling by attribute pair name and
# controlling size of correlation bubbles
showData(conn, tableName='pitching', tableInfo=pitchingInfo, 
         include=c('era','h','hr','gs','g','sv'), 
         format='corr', corrLabel='pair', shapeSizeRange=c(5,25))

# Histogram on all numeric attributes
showData(conn, tableName='pitching', tableInfo=pitchingInfo, include=c('hr'), 
         format='histogram')

# Overview is a histogram of statistical measures across attributes
showData(conn, tableName='pitching', tableInfo=pitchingInfo, 
         format='overview', type='numeric', scales="free_y")

# Scatterplots
# Scatterplot on pair of numerical attributes
# sample by size with 1d facet (see \code{\link{facet_wrap}})
showData(conn, 'pitching_enh', format='scatterplot', 
         include=c('so', 'er'), facetName="lgid", pointColour="lgid", 
         sampleSize=10000, regressionLine=TRUE,
         title="SO vs ER by League 1980-2000",
         where='yearid between 1980 and 2000')

# sample by fraction with 2d facet (see \code{\link{facet_grid}})
showData(conn, 'pitching_enh', format='scatterplot', 
         include=c('so','er'), facetName=c('lgid','decadeid'), pointColour="lgid",
         sampleFraction=0.1, regressionLine=TRUE,
         title="SO vs ER by League by Decade 1980 - 2012",
         where='yearid between 1980 and 2012')
}

Run the code above in your browser using DataLab