Learn R Programming

Matrix.utils (version 0.9.1)

dMcast: Casts or pivots a long data frame into a wide sparse matrix

Description

Similar in function to dcast, but produces a sparse Matrix as an output. Sparse matrices are beneficial for this application because such outputs are often very wide and sparse. Conceptually similar to a pivot operation.

Usage

dMcast(data, formula, fun.aggregate = "sum", value.var = NULL, as.factors = FALSE)

Arguments

data
a data frame
formula
casting formula, see details for specifics.
fun.aggregate
name of aggregation function. Defaults to 'sum'
value.var
name of column that stores values to be aggregated numerics
as.factors
if TRUE, treat all columns as factors, including

Value

a sparse Matrix

Details

Casting formulas are slightly different than those in dcast and follow the conventions of model.matrix. See formula for details. Briefly, the left hand side of the ~ will be used as the grouping criteria. This can either be a single variable, or a group of variables linked using :. The right hand side specifies what the columns will be. Unlike dcast, using the + operator will append the values for each variable as additional columns. This is useful for things such as one-hot encoding. Using : will combine the columns as interactions.

See Also

cast

dcast

Examples

Run this code
#Classic air quality example
melt<-function(data,idColumns)
{
  cols<-setdiff(colnames(data),idColumns)
  results<-lapply(cols,function (x) cbind(data[,idColumns],variable=x,value=as.numeric(data[,x])))
  results<-Reduce(rbind,results)
}
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, idColumns=c("month", "day"))
dMcast(aqm, month:day ~variable,fun.aggregate = 'mean',value.var='value')
dMcast(aqm, month ~ variable, fun.aggregate = 'mean',value.var='value') 

#One hot encoding
#Preserving numerics
dMcast(warpbreaks,~.)
#Pivoting numerics as well
dMcast(warpbreaks,~.,as.factors=TRUE)

## Not run: 
# orders<-data.frame(orderNum=as.factor(sample(1e6, 1e7, TRUE)), 
#    sku=as.factor(sample(1e3, 1e7, TRUE)), 
#    customer=as.factor(sample(1e4,1e7,TRUE)), 
#    state = sample(letters, 1e7, TRUE),
#    amount=runif(1e7)) 
# # For simple aggregations resulting in small tables, dcast.data.table (and
#    reshape2) will be faster
# system.time(a<-dcast.data.table(as.data.table(orders),sku~state,sum,
#    value.var = 'amount')) # .5 seconds 
# system.time(b<-reshape2::dcast(orders,sku~state,sum,
#    value.var = 'amount')) # 2.61 seconds 
# system.time(c<-dMcast(orders,sku~state,
#    value.var = 'amount')) # 28 seconds 
#    
# # However, this situation changes as the result set becomes larger 
# system.time(a<-dcast.data.table(as.data.table(orders),customer~sku,sum,
#    value.var = 'amount')) # 4.4 seconds 
# system.time(b<-reshape2::dcast(orders,customer~sku,sum,
#    value.var = 'amount')) # 34.7 seconds 
#  system.time(c<-dMcast(orders,customer~sku,
#    value.var = 'amount')) # 27 seconds 
#    
# # More complicated: 
# system.time(a<-dcast.data.table(as.data.table(orders),customer~sku+state,sum,
#    value.var = 'amount')) # 18.1 seconds, object size = 2084 Mb 
# system.time(b<-reshape2::dcast(orders,customer~sku+state,sum,
#    value.var = 'amount')) # Does not return 
# system.time(c<-dMcast(orders,customer~sku:state,
#    value.var = 'amount')) # 30.69 seconds, object size = 115.4 Mb
# 
# system.time(a<-dcast.data.table(as.data.table(orders),orderNum~sku,sum,
#    value.var = 'amount')) # Does not return 
# system.time(c<-dMcast(orders,orderNum~sku,
#    value.var = 'amount')) # 36.33 seconds, object size = 175Mb
# ## End(Not run)

Run the code above in your browser using DataLab