Learn R Programming

toaster (version 0.5.5)

computeSample: Randomly sample data from the table.

Description

Draws a sample of rows from the table randomly. The function offers two sampling approaches and three stratum strategies. Sampling approaches by
  • sample fraction: a simple binomial (Bernoulli) sampling on a row-by-row basis with given sample rate(s) (see sampleFraction)
  • sample size: sampling a given number of rows without replacement (see sampleSize)

Stratum strategies:

  • single stratum: the whole table or its subset (defined using where).
  • by column values: using conditionColumn and conditionValues arguments define stratum per value in the table column.
  • by SQL expression: using conditionStratum and conditionValues arguments define stratum using SQL expression (with SQL CASE function but not necessarily) per value.

Usage

computeSample(channel, tableName, sampleFraction, sampleSize, conditionColumn = NULL, conditionStratum = NULL, conditionValues = NULL, include = NULL, except = NULL, where = NULL, ..., test = FALSE)

Arguments

channel
connection object as returned by odbcConnect
tableName
table name
sampleFraction
one or more sample fractions to use in the sampling of data. Multipe sampling fractions are applicable only in combination with the arguments conditionColumn and conditionValues when present. In this case number of fractions in sampleFraction and number of values in conditionValues must be the same.
sampleSize
total sample size (applies only when sampleFraction is missing). This may too be a vector of total values when used in combination with the arguments conditionColumn and conditionValues. In this case number of sizes in sampleSize and number of values in conditionValues must be the same.
conditionColumn
if you use this argument, you must also use the conditionValues argument. Either both are used, or neither is used. Values in a particular column conditionColumn are used as sampling conditions directly and its data type must be of a group-able type. Only those values listed in conditionValues are used for sampling with the rest ignored. Also, see conditionStratum.
conditionStratum
if you use this argument, you must also use the conditionValues argument. When defined it is used in place of conditionColumn. conditionStratum should define a SQL expression (usually using CASE function but not necessarily). Resulting sample data frame will contain a column named stratum just as if conditionColumn = 'stratum' was used. Arguments conditionColumn and conditionStratum are mutually exclusive: the former is ignored if both are defined.
conditionValues
see argument conditionColumn and conditionStratum.
include
a vector of column names to include. Output never contains attributes other than in the list.
except
a vector of column names to exclude. Output never contains attributes from the list.
where
specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside WHERE clause).
...
additional arguments to be passed to sqlQuery for more control over performance and data type conversion. By default, stringsAsFactors is set to FALSE.
test
logical: if TRUE show what would be done, only (similar to parameter test in RODBC functions like sqlQuery and sqlSave).

Details

The sampling can be applied to the entire table or can be refined with either conditionColumn or conditionStratum. In each case a subset of the table defined with where argument may apply too. The resulting stratum models are:
  • Single Sample Fraction: provide only one value in sampleFraction, this single fraction is targeted throughout the whole population or across all the strata defined by the sample conditions conditionColumn or conditionStrata in combination with conditionValues.
  • Variable Sample Fractions: provide multiple values in sampleFraction, each of them is used for sampling a particular stratum defined by the conditionColumn or conditionStratum arguments in combination with conditionValues. Number of values in sampleFraction and conditionValues must be the same.
  • Total Sample Size: provide only one value in sampleSize for the total sample size for the entire population. If in addition you specify the conditionColumn or conditionStratum arguments, the function proportionally generates sample units for each stratum defined by the conditionColumn or conditionStratum arguments in combination with conditionValues.
  • Variable Sample Sizes: provide multiple sizes in sampleSize so that each size corresponds to a stratum defined by conditionColumn or conditionStratum arguments in combination with conditionValues. The sample function generates sample units directly for each stratum based on the supplied sizes. Number of values in sampleSize and conditionValues must be the same.

No columns returned as character and not excluded by as.is are converted to factors by default, i.e. stringsAsFactors = FALSE when calling sqlQuery if not specified when calling this function.

Examples

Run this code
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

batters = computeSample(conn, "batting", sampleFraction=0.01)
dim(batters)

pitchersAL = computeSample(conn, "pitching", sampleSize=1000, where="lgid = 'AL'")
dim(ptichersAL)

battersByDecadesSingleSize = computeSample(conn, tableName = 'batting_enh', 
                                           sampleSize=1000, 
                                           conditionColumn = 'decadeid', 
                                           conditionValues = c(1990,2000,2010))
dim(battersByDecadesSingleSize)

battersByDecades = computeSample(conn, tableName = 'batting_enh',
                                 sampleFraction = c(0.01,0.01,0.02), 
                                 conditionColumn = 'decadeid', conditionValues = c(1990,2000,2010))
dim(battersByDecades)

battersByOddEvenYears = computeSample(channel=NULL, tableName = 'batting_enh',
                                      sampleFraction = c(0.01,0.02),
                                      include = c('decadeid','g','ab','r','h'),
                                      conditionStratum = "yearid % 2", 
                                      conditionValues = c('0','1'),
                                      where = "lgid = 'NL'")
dim(battersByOddEvenYears)

battersBeforeAfter1960 = computeSample(channel=NULL, tableName = 'batting_enh',
                                       sampleSize = c(200, 200), 
                                       conditionStratum = "CASE WHEN yearid <- 1960 THEN 'before'
                                                                ELSE 'after'
                                                          END", 
                                       conditionValues = c('before','after'))
dim(battersBeforeAfter1960)
}

Run the code above in your browser using DataLab