Learn R Programming

toaster (version 0.5.5)

computeCorrelations: Compute correlation between pairs of columns.

Description

Compute global correlation between all pairs of numeric columns in table. Result includes all pairwise combinations of numeric columns in the table, with optionally limiting columns to those in the parameter include or/and excluding columns defined by parameter except. Limit computation on the table subset defined with where. Use output='matrix' to produce results in matrix format (compatible with function cor).

Usage

computeCorrelations(channel, tableName, tableInfo, include = NULL, except = NULL, where = NULL, by = NULL, output = c("data.frame", "matrix"), test = FALSE)

Arguments

channel
connection object as returned by odbcConnect
tableName
database table name
tableInfo
pre-built summary of data to use (must have with test=TRUE)
include
a vector of column names to include. Output never contains attributes other than in the list. When missing all columns from tableInfo included.
except
a vector of column names to exclude. Output never contains attributes from the list.
where
specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside WHERE clause).
by
vector of column names to group results by one or more table columns for faceting or alike (optional).
output
Default output is a data frame of column pairs with correlation coefficient (melt format). To return correlation matrix compatible with function cor use 'matrix' .
test
logical: if TRUE show what would be done, only (similar to parameter test in RODBC functions like sqlQuery and sqlSave).

Value

data frame with columns:
  • corr pair of 1st and 2d columns "column1:column2"
  • value computed correlation value
  • metric1 name of 1st column
  • metric2 name of 2d column
  • sign correlation value sign sign(value) (-1, 0, or 1)
Note that while number of correlations function computes is choose(N, 2), where N is number of table columns specified, resulting data frame contains twice as many rows by duplicating each correlation value with swaped column names (1st column to 2d and 2d to 1st positions). This makes resulting data frame symmetrical with respect to column order in pairs and is necessary to correctly visualize correlation matrix with createBubblechart.

See Also

createBubblechart and showData.

Examples

Run this code
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

cormat = computeCorrelations(channel=conn, "pitching_enh", sqlColumns(conn, "pitching_enh"), 
                             include = c('w','l','cg','sho','sv','ipouts','h','er','hr','bb',
                                         'so','baopp','era','whip','ktobb','fip'),
                             where = "decadeid = 2000", test=FALSE)
# remove duplicate correlation values (no symmetry)
cormat = cormat[cormat$metric1 < cormat$metric2, ]
createBubblechart(cormat, "metric1", "metric2", "value", label=NULL, fill="sign")

# Grouped by columns
cormatByLg = computeCorrelations(channel=conn, "pitching_enh", 
                                 include=c('w','sv','h','er','hr','bb','so'),
                                 by=c('lgid','decadeid'), 
                                 where = "decadeid >= 1990")
                                 
createBubblechart(cormatByLg, "metric1", "metric2", "value", 
                  label=NULL, fill="sign", facet=c('decadeid','lgid'), 
                  title="Correlations by Leagues and Decades",
                  defaultTheme = theme_wsj(), legendPosition = 'none')
}

Run the code above in your browser using DataLab