cdata package - RDocumentation

cdata is a general data re-shaper that has the great virtue of adhering to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.
The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The point being: it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

Briefly: cdata supplies data transform operators that:

Work on local data or with any DBI data source.
Are powerful generalizations of the operations commonly called pivot and un-pivot.

A quick example: plot iris petal and sepal dimensions in a faceted graph.

iris <- data.frame(iris)

library("ggplot2")
library("cdata")

#
# build a control table with a "key column" flower_part
# and "value columns" Length and Width
#
controlTable <- wrapr::qchar_frame(
   flower_part, Length      , Width       |
   Petal    , Petal.Length, Petal.Width |
   Sepal    , Sepal.Length, Sepal.Width )

# do the unpivot to convert the row records to block records
iris_aug <- rowrecs_to_blocks(
  iris,
  controlTable,
  columnsToCopy = c("Species"))


ggplot(iris_aug, aes(x=Length, y=Width)) +
  geom_point(aes(color=Species, shape=Species)) + 
  facet_wrap(~flower_part, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +  scale_color_brewer(palette = "Dark2")

More details on the above example can be found here. A tutorial on how to design a controlTable can be found here. And some discussion of the nature of records in cdata can be found here.

We can also exhibit a larger example of using cdata to create a scatter-plot matrix, or pair plot:


iris <- data.frame(iris)

library("ggplot2")
library("cdata")

# declare our columns of interest
meas_vars <- qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
category_variable <- "Species"

# build a control with all pairs of variables as value columns
# and pair_key as the key column
controlTable <- data.frame(expand.grid(meas_vars, meas_vars, 
                                       stringsAsFactors = FALSE))
# name the value columns value1 and value2
colnames(controlTable) <- qc(value1, value2)
# insert first, or key column
controlTable <- cbind(
  data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]),
             stringsAsFactors = FALSE),
  controlTable)


# do the unpivot to convert the row records to multiple block records
iris_aug <- rowrecs_to_blocks(
  iris,
  controlTable,
  columnsToCopy = category_variable)

# unpack the key column into two variable keys for the facet_grid
splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE)
iris_aug$v1 <- vapply(splt, function(si) si[[1]], character(1))
iris_aug$v2 <- vapply(splt, function(si) si[[2]], character(1))


ggplot(iris_aug, aes(x=value1, y=value2)) +
  geom_point(aes_string(color=category_variable, shape=category_variable)) + 
  facet_grid(v2~v1, labeller = label_both, scale = "free") +
  ggtitle("Iris dimensions") +
  scale_color_brewer(palette = "Dark2") +
  ylab(NULL) + 
  xlab(NULL)

The above is now wrapped into a one-line command in WVPlots.

And a quick database example:

library("cdata")
library("rquery")

use_spark <- FALSE

if(use_spark) {
  my_db <- sparklyr::spark_connect(version='2.2.0', 
                                   master = "local")
} else {
  my_db <- DBI::dbConnect(RSQLite::SQLite(),
                          ":memory:")
}



# pivot example
d <- wrapr::build_frame(
   "meas", "val" |
   "AUC" , 0.6   |
   "R2"  , 0.2   )
DBI::dbWriteTable(my_db,
                  'd',
                  d,
                  temporary = TRUE)
rstr(my_db, 'd')
 #  table `d` SQLiteConnection 
 #   nrow: 2 
 #  'data.frame':   2 obs. of  2 variables:
 #   $ meas: chr  "AUC" "R2"
 #   $ val : num  0.6 0.2
td <- db_td(my_db, "d")
td
 #  [1] "table(`d`; meas, val)"

cT <- td %.>%
  build_pivot_control(.,
                      columnToTakeKeysFrom= 'meas',
                      columnToTakeValuesFrom= 'val') %.>%
  execute(my_db, .)
print(cT)
 #    meas val
 #  1  AUC AUC
 #  2   R2  R2

tab <- td %.>%
  blocks_to_rowrecs(.,
                    keyColumns = NULL,
                    controlTable = cT,
                    temporary = FALSE) %.>%
  materialize(my_db, .)

print(tab)
 #  [1] "table(`rquery_mat_84169225764052913511_0000000000`; AUC, R2)"
  
rstr(my_db, tab)
 #  table `rquery_mat_84169225764052913511_0000000000` SQLiteConnection 
 #   nrow: 1 
 #  'data.frame':   1 obs. of  2 variables:
 #   $ AUC: num 0.6
 #   $ R2 : num 0.2

if(use_spark) {
  sparklyr::spark_disconnect(my_db)
} else {
  DBI::dbDisconnect(my_db)
}

The cdata package is a demonstration of the "coordinatized data" theory and includes an implementation of the "fluid data" methodology. The recommended tutorial is: Fluid data reshaping with cdata. We also have a short free cdata screencast (and another example can be found here).

Install via CRAN:

install.packages("cdata")

Note: cdata is targeted at data with "tame column names" (column names that are valid both in databases, and as R unquoted variable names) and basic types (column values that are simple R types such as character, numeric, logical, and so on).

Copy Link

Version

Install

Monthly Downloads

Version

License

Issues

Pull Requests

Stars

Forks

Repository

Homepage

Maintainer

Last Published

Functions in cdata (1.0.7)