Learn R Programming

toaster (version 0.5.1)

getTableSummary: Compute columnwise statistics on Aster table.

Description

For table compute column statistics in Aster and augment data frame structure obtained with sqlColumns with columns containing computed statistics.

Usage

getTableSummary(channel, tableName, include = NULL, except = NULL, modeValue = FALSE, percentiles = c(5, 10, 25, 50, 75, 90, 95, 100), where = NULL, mock = FALSE, parallel = FALSE)

Arguments

channel
object as returned by odbcConnect.
tableName
name of the table in Aster.
include
a vector of column names to include. Output never contains attributes other than in the list.
except
a vector of column names to exclude. Output never contains attributes from the list.
modeValue
logical indicates if mode values should be computed. Default is FALSE.
percentiles
list of percentiles (integers between 0 and 100) to collect (always collects 25th and 75th for IQR calculation). There is no penalty in specifying more percentiles as they get calculated in a single call for each column - no matter how many different values are requested. When FALSE then percentiles calculations are skipped and result wil have no percentile and IQR columns.
where
SQL WHERE clause limiting data from the table (use SQL as if in WHERE clause but omit keyword WHERE).
mock
logical: if TRUE returns pre-computed table statistics for tables pitching or batting, only.
parallel
logical: enable parallel calls to Aster database. This option requires parallel backend enabled and registered (see in examples). Parallel execution requires ODBC channel obtained without explicit password: either with odbcConnect(dsn) or odbcDriverConnect calls, but not with odbcConnect(dsn, user, password).

Value

data frame returned by sqlColumns with additional columns:
total_count
total row count - the same for each table column
distinct_count
distinct values count
not_null_count
not null count
minimum
minimum value (numerical data types only)
maximum
maximum value (numerical data types only)
average
mean (numerical data types only)
deviation
standard deviation (numerical data types only)
percentiles
defaults: 0,5,10,25,50,75,90,95,100. Always adds percentiles 25, 50 (median), 75
IQR
interquartile range is the 1st Quartile subtracted from the 3rd Quartile
minimum_str
minimum string value (character data types only)
maximum_str
maximum string value (character data types only)
mode
mode value (optional)
mode_count
mode count (optional)

Details

Computes columns statistics for all or specified table columns and adds them to the data frame with basic ODBC table metadata obtained with sqlColumns. Computed statistics include counts of all, non-null, distinct values; statistical summaries of maximum, minimum, mean, standard deviation, median (50th percentile), mode (optional), interquartile range, and desired percentiles. Each computed statistic adds a column to ODBC metadata data frame.

See Also

sqlColumns

Examples

Run this code
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

pitchingInfo = getTableSummary(channel=conn, 'pitching_enh')
# list all table columns
pitchingInfo$COLUMN_NAME

# compute statistics on subset of baseball data after 1999
battingInfo = getTableSummary(channel=conn, 'batting_enh', 
                              where='yearid between 2000 and 2013')
                              
# compute statistics for certain columns including each percentile from 1 to 99
pitchingInfo = getTableSummary(channel=conn, 'pitching_enh',
                              include=c('h', 'er', 'hr', 'bb', 'so'),
                              percentiles=seq(1,99))
# list data frame column names to see all computed statistics
names(pitchingInfo)
                             
# compute statitics on all numeric columns except certain columns
teamInfo = getTableSummary(channel=conn, 'teams_enh', 
                   include=getNumericColumns(sqlColumns(conn, 'teams_enh')),
                   except=c('lgid', 'teamid', 'playerid', 'yearid', 'decadeid'))
}

Run the code above in your browser using DataLab