Learn R Programming

toaster (version 0.5.1)

computeKmeans: Perform k-means clustering on the table.

Description

K-means clustering algorithm runs in-database, returns object compatible with kmeans and includes arbitrary aggregate metrics computed on resulting clusters.

Usage

computeKmeans(channel, tableName, centers, threshold = 0.0395, iterMax = 10, tableInfo, id, include = NULL, except = NULL, aggregates = "COUNT(*) cnt", scale = TRUE, idAlias = gsub("[^0-9a-zA-Z]+", "_", id), where = NULL, scaledTableName = NULL, centroidTableName = NULL, schema = NULL, test = FALSE)

Arguments

channel
connection object as returned by odbcConnect.
tableName
Aster table name.
centers
either the number of clusters, say k, or a matrix of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres. If a matrix then number of rows determines the number of clusters as each row determines initial center.
threshold
the convergence threshold. When the centroids move by less than this amount, the algorithm has converged.
iterMax
the maximum number of iterations the algorithm will run before quitting if the convergence threshold has not been met.
tableInfo
pre-built summary of data to use (require when test=TRUE). See getTableSummary.
id
column name or SQL expression containing unique table key.
include
a vector of column names with variables (must be numeric). Model never contains variables other than in the list.
except
a vector of column names to exclude from variables. Model never contains variables from the list.
aggregates
vector with SQL aggregates that define arbitrary aggreate metrics to be computed on each cluster after running k-means. Aggregates may have optional aliases like in "AVG(era) avg_era". Subsequently, used in createClusterPlot as cluster properties.
scale
logical if TRUE then scale each variable in-database before clustering. Scaling performed results in 0 mean and unit standard deviation for each of input variables.
idAlias
SQL alias for table id. This is required when SQL expression is given for id.
where
specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside WHERE clause).
scaledTableName
name of Aster table with results of scaling
centroidTableName
name of Aster table with centroids found by kmeans
schema
name of Aster schema tables scaledTableName and centroidTableName belong.
test
logical: if TRUE show what would be done, only (similar to parameter test in RODBC functions: sqlQuery and sqlSave).

Value

computeKmeans returns an object of class "toakmeans" (compatible with class "kmeans"). It is a list with at least the following components:
cluster
A vector of integers (from 0:K-1) indicating the cluster to which each point is allocated. computeKmeans leaves this component empty. Use function computeClusterSample to set this compoenent.
centers
A matrix of cluster centres.
totss
The total sum of squares.
withinss
Vector of within-cluster sum of squares, one component per cluster.
tot.withinss
Total within-cluster sum of squares, i.e. sum(withinss).
betweenss
The between-cluster sum of squares, i.e. totss-tot.withinss.
size
The number of points in each cluster. These includes all points in the Aster table specified that satisfy optional where condition.
iter
The number of (outer) iterations.
ifault
integer: indicator of a possible algorithm problem (always 0).
scale
logical: indicates if variable scaling was performed before clustering.
aggregates
Vectors (dataframe) of aggregates computed on each cluster.
tableName
Aster table name containing data for clustering.
columns
Vector of column names with variables used for clustering.
scaledTableName
Aster table name containing scaled data for clustering.
centroidTableName
Aster table name containing cluster centroids.
id
Column name or SQL expression containing unique table key.
idAlias
SQL alias for table id.
whereClause
SQL WHERE clause expression used (if any).
time
An object of class proc_time with user, system, and total elapsed times for the computeKmeans function call.

Details

The function fist scales not-null data (if scale=TRUE) or just eliminate nulls without scaling. After that the data given (table tableName with option of filering with where) are clustered by the k-means in Aster. Next, all standard metrics of k-means clusters plus additional aggregates provided with aggregates are calculated again in-database.

See Also

computeClusterSample, computeSilhouette

Examples

Run this code
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
                         
km = computeKmeans(conn, "batting", centers=5, iterMax = 25,
                   aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
                   id="playerid || '-' || stint || '-' || teamid || '-' || yearid", 
                   include=c('g','r','h'), scaledTableName='kmeans_test_scaled', 
                   centroidTableName='kmeans_test_centroids',
                   where="yearid > 2000")
km
createCentroidPlot(km)
createClusterPlot(km)
}

Run the code above in your browser using DataLab