splitTable: Construct sparse matrices from a nominal matrix/dataframe

Description

This function splits a matrix or dataframe into two sparse matrices: an incidence and an index matrix. The incidence matrix links the observations (rows) to all possible values that occur in the original matrix. The index matrix links the values to the attributes (columns). This encoding allows for highly efficient calculations on nominal data.

Usage

splitTable(data, attributes = colnames(data), observations = rownames(data),
		name.binder = ":")

Arguments

Value

A list containing the various row and column names, and the two sparse pattern matrices of format ngCMatrix:attributesvector of attribute namesvaluesvector of unique value namesobservationsvector of observation namesOVsparse pattern matrix with observations as rows (O) and values as columns (V)AVsparse pattern matrix with attributes as rows (A) and values as columns (V)

Examples

Run this code

# start with a simple example from the MASS library
# compare the original data with the encoding as sparse matrices
library(MASS)
farms
splitTable(farms)

# As a more involved example, consider the WALS data included in this package
# Transforming the reasonably large WALS data.frame \code{wals$data} is fast
# (2566 observations, 131 attributes, 630 unique values)
# The function `str' gives a useful summary of the result of the splitting
data(wals)
system.time(W <- splitTable(wals$data))
str(W) 

# Some basic use examples on the complete WALS data.
# The OV-matrix can be used to quickly count the number of similarities 
# between all pairs of observations. Note that with the large amount of missing values
# the resulting numbers are not really meaningfull. Some normalisation is necessary.
system.time( O <- tcrossprod(W$OV*1) )
O[1:10,1:10]

# The number of comparisons available for each pair of attributes
system.time( N <- crossprod(tcrossprod(W$OV*1, W$AV*1)) )
N[1:10,1:10]

# compute the number of available datapoints per observation (language) in WALS
# once the sparse matrices W are computed, such calculations are much quicker than 'apply'
system.time( avail1 <- rowSums(W$OV) )
system.time( avail2 <- apply(wals$data,1,function(x){sum(!is.na(x))}))
names(avail2) <- NULL
all.equal(avail1, avail2)

# Very unequal availability of data over languages in WALS
hist(avail1)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples