dm v0.0.3.9003

Relational data models

Some actions and transformations that are frequently required when dealing with data models are covered by the functions of this package.

Readme

Lifecycle:
experimental Travis build
status Codecov test
coverage CRAN
status Launch
rstudio.cloud

dm

The goal of {dm} is to provide tools for working with multiple tables.

Skip to the Features section if you are familiar with relational data models.

Why?

The motivation for the {dm} package is a more sophisticated data management. {dm} uses the relational data model and its core concept of splitting one table into multiple tables.

This has a hugh advantage: The code becomes simpler.

Example

As an example, we consider the nycflights13 dataset. This dataset contains five tables: the main flights table with links into the airlines, planes and airports tables, and a weather table without an explicit link.

Assume your task is to merge all tables (except the weather table). This cross-referencing is a common first step when modelling or plotting data.

In {dm} the basic element is a dm object. You can create it with cdm_nycflights13() for the example data. After that you can use the links between the tables as often as you wish - without explicitly referring to the relations ever again. The task of joining four tables (flights, airlines, planes and airports) boils down to:

cdm_nycflights13() %>%
  cdm_flatten_to_tbl(start = flights)
#> Renamed columns:
#> * name -> airlines$airlines.name, airports$airports.name
#> * year -> flights$flights.year, planes$planes.year
#> # A tibble: 336,776 x 35
#>    flights.year month   day dep_time sched_dep_time dep_delay arr_time
#>           <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1         2013     1     1      517            515         2      830
#>  2         2013     1     1      533            529         4      850
#>  3         2013     1     1      542            540         2      923
#>  4         2013     1     1      544            545        -1     1004
#>  5         2013     1     1      554            600        -6      812
#>  6         2013     1     1      554            558        -4      740
#>  7         2013     1     1      555            600        -5      913
#>  8         2013     1     1      557            600        -3      709
#>  9         2013     1     1      557            600        -3      838
#> 10         2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 28 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, airlines.name <chr>,
#> #   airports.name <chr>, lat <dbl>, lon <dbl>, alt <int>, tz <dbl>,
#> #   dst <chr>, tzone <chr>, planes.year <int>, type <chr>,
#> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>,
#> #   speed <int>, engine <chr>

In contrast, using the classical {dplyr} notation you need three left_join() calls to merge the flights table gradually to airlines, planes and airports tables to create one wide data frame.

library(tidyverse)
library(nycflights13)

flights %>%
  left_join(airlines, by = "carrier") %>%
  left_join(planes, by = "tailnum") %>%
  left_join(airports, by = c("origin" = "faa"))
#> # A tibble: 336,776 x 35
#>    year.x month   day dep_time sched_dep_time dep_delay arr_time
#>     <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1   2013     1     1      517            515         2      830
#>  2   2013     1     1      533            529         4      850
#>  3   2013     1     1      542            540         2      923
#>  4   2013     1     1      544            545        -1     1004
#>  5   2013     1     1      554            600        -6      812
#>  6   2013     1     1      554            558        -4      740
#>  7   2013     1     1      555            600        -5      913
#>  8   2013     1     1      557            600        -3      709
#>  9   2013     1     1      557            600        -3      838
#> 10   2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 28 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name.x <chr>, year.y <int>,
#> #   type <chr>, manufacturer <chr>, model <chr>, engines <int>,
#> #   seats <int>, speed <int>, engine <chr>, name.y <chr>, lat <dbl>,
#> #   lon <dbl>, alt <int>, tz <dbl>, dst <chr>, tzone <chr>

You can find more information and important terms to jump-start working with {dm} in the article “Introduction to Relational Data Models”.

The Advantages in Brief

The separation into multiple tables achieves several goals:

  • Avoid repetition, conserve memory: the information related to each airline, airport, and airplane are stored only once
    • name of each airline
    • name, location and altitude of each airport
    • manufacturer and number of seats for each airplane
  • Improve consistency: for updating any information (e.g. the name of an airport), it is sufficient to update in only one place
  • Segmentation: information is organized by topic, individual tables are smaller and easier to handle

Good to Know

Multiple, linked tables are a common concept in database management. Since many R users have a background in other disciplines, we present six important terms in relational data modeling to jump-start working with {dm}.

1) Data frames and tables

A data frame is a fundamental data structure in R. If you imagine it visually, the result is a typical table structure. That’s why working with data from spreadsheets is so convenient and users of the the popular {dplyr} package for data wrangling mainly rely on data frames.

The downside: Data frames and flat file systems like spreadsheets can result in bloated tables, that hold many repetitive values. Worst case, you have a data frame with multiple columns and in each row only a single value is different.

This calls for a better data organization by utilizing the resemblance between data frames and database tables, which consist of columns and rows, too. The elements are just called differently:

Data Frame Table
Column Attribute
Row Tuple

Therefore, the separation into multiple tables is a first step that helps data quality. But without an associated data model you don’t take full advantage. For example, joining is more complicated than it should be. This is illustrated above.

With {dm} you can have the best of both worlds: Manage your data as linked tables, then flatten multiple tables into one for your analysis with {dplyr} on an as-needed basis.

2) Model

A data model shows the structure between multiple tables that can be linked together. The nycflights13 relations can be transferred into the following graphical representation:

The flights table is linked to three other tables: airlines, planes and airports. By using directed arrows the visualization explicitly shows the connection between different columns/attributes. For example: The column carrier in flights can be joined with the column carrier from the airlines table. Further Reading: The {dm} methods for visualizing data models.

The links between the tables are established through primary keys and foreign keys.

3) Primary Keys

In a relational data model every table needs to have one column/attribute that uniquely identifies a row. This column is called primary key (abbreviated with pk). The primary key column has unique values and can’t contain NA or NULL values. If no such column exists, it is common practice to create a synthetic column of numeric or globally unique identifiers (surrogate key).

In the airlines table of nycflights13 the column carrier is the primary key.

Further Reading: The {dm} package offers several function for dealing with primary keys.

4) Foreign Keys

The counterpart of a primary key in one table is the foreign key in another table. In order to join two tables, the primary key of the first table needs to be available in the second table, too. This second column is called the foreign key (abbreviated with fk).

For example, if you want to link the airlines table in the nycflights13 data to the flights table, the primary key in the airlines table is carrier which is present as foreign key carrier in the flights table.

Further Reading: The {dm} functions for working with foreign keys.

5) Normalization

One main goal is to keep the data organization as clean and simple as possible by avoiding redundant data entries. Normalization is the technical term that describes this central design principle of a relational data model: splitting data into multiple tables. A normalized data schema consists of several relations (tables) that are linked with attributes (columns) with primary and foreign keys.

For example, if you want to change the name of one airport in nycflights13, you have to change only a single data entry. Sometimes, this principle is called “single point of truth”.

See the Wikipedia article on database normalization for more details. Consider reviewing the Simple English version for a gentle introduction.

6) Relational Databases

dm is built upon relational data models, but it is not a database itself. Databases are systems for data management and many of them are constructed as relational databases, e.g. SQLite, MySQL, MSSQL, Postgres. As you can guess from the names of the databases SQL, the structured querying language plays an important role: It was invented for the purpose of querying relational databases.

Therefore, {dm} can copy data from and to databases, and works transparently with both in-memory data and with relational database systems.

Features

This package helps with many challenges that arise when working with relational data models.

Compound object

The dm class manages several related tables. It stores both the data and the metadata in a compound object, and defines operations on that object. These operations either affect the data (e.g., a filter), or the metadata (e.g., definition of keys or creation of a new table), or both.

  • data: a table source storing all tables
  • metadata: table names, column names, primary and foreign keys

This concept helps separating the join logic from the code: declare your relationships once, as part of your data, then use them in your code without repeating yourself.

Storage agnostic

The {dm} package augments {dplyr}/{dbplyr} workflows. Generally, if you can use {dplyr} on your data, it’s likely that you can use {dm} too. This includes local data frames, relational database systems, and many more.

Data preparation

A battery of utilities helps with creating a tidy relational data model.

  • Splitting and rejoining tables
  • Determining key candidates
  • Checking keys and cardinalities

Example

A readymade dm object with preset keys is included in the package:

library(dm)

cdm_nycflights13()
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   3 primary keys
#>   3 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

The cdm_draw() function creates a visualization of the entity relationship model:

cdm_nycflights13(cycle = TRUE) %>%
  cdm_draw()

Filtering and joining

Similarly to dplyr::filter(), a filtering function cdm_filter() is available for dm objects. You need to provide the dm object, the table whose rows you want to filter, and the filter expression. The actual effect of the filtering will only be realized once you use cdm_apply_filters. Before that, the filter conditions are merely stored within the dm. After using cdm_apply_filters() a dm object is returned whose tables only contain rows that are related to the reduced rows in the filtered table. This currently only works for cycle-free relationships between the tables.

cdm_nycflights13(cycle = FALSE) %>%
  cdm_get_tables() %>%
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>       16     1458   336776     3322    26115

cdm_nycflights13(cycle = FALSE) %>%
  cdm_filter(planes, year == 2000, manufacturer == "BOEING") %>%
  cdm_apply_filters() %>%
  cdm_get_tables() %>%
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>        4        3     7301      134    26115

For joining two tables using their relationship defined in the dm, you can use cdm_join_tbl():

cdm_nycflights13(cycle = FALSE) %>%
  cdm_join_to_tbl(airports, flights, join = semi_join)
#> # A tibble: 336,776 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1     1      517            515         2      830
#>  2  2013     1     1      533            529         4      850
#>  3  2013     1     1      542            540         2      923
#>  4  2013     1     1      544            545        -1     1004
#>  5  2013     1     1      554            600        -6      812
#>  6  2013     1     1      554            558        -4      740
#>  7  2013     1     1      555            600        -5      913
#>  8  2013     1     1      557            600        -3      709
#>  9  2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

In our dm, the origin column of the flights table points to the airports table. Since all nycflights13-flights depart from New York, only these airports are included in the semi-join.

From and to databases

In order to transfer an existing dm object to a DB, you can call cdm_copy_to() with the target DB and the dm object:

src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src:  sqlite 3.29.0 [:memory:]
#> tbls:
nycflights13_remote <- cdm_copy_to(src_sqlite, cdm_nycflights13(cycle = TRUE))
nycflights13_remote
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  sqlite 3.29.0 [:memory:]
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   3 primary keys
#>   4 references
#> ── Filters ────────────────────────────────────────────────────────────────
#> None

The key constraints from the original object are also copied to the newly created object. With the default setting set_key_constraints = TRUE for cdm_copy_to(), key constraints are also established on the target DB. Currently this feature is only supported for MSSQL and Postgres database management systems (DBMS).

It is also possible to automatically create a dm object from the permanent tables of a DB. Again, for now just MSSQL and Postgres are supported for this feature, so the next chunk is not evaluated. The support for other DBMS will be implemented in a future update.

src_postgres <- src_postgres()
nycflights13_from_remote <- cdm_learn_from_db(src_postgres)

More information

If you would like to learn more about {dm}, the Intro article is a good place to start. Further resources:

Standing on the shoulders of giants

This package follows the tidyverse principles:

  • dm objects are immutable (your data will never be overwritten in place)
  • many functions used on dm objects are pipeable (i.e., return new dm objects)
  • tidy evaluation is used (unquoted function parameters are supported)

The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!

The {polyply} package has a similar intent with a slightly different interface.

The {data.cube} package has quite the same intent using array-like interface.

Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.

The {tidygraph} package stores a network as two related tables of nodes and edges, compatible with {dplyr} workflows.

In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.

Installation

Once on CRAN, the package can be installed with

install.packages("dm")

Install the latest development version with

# install.packages("devtools")
devtools::install_github("krlmlr/dm")

License: MIT © cynkra GmbH.

Funded by:

energie360° cynkra


Please note that the ‘dm’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Functions in dm

Name Description
cdm_get_pk Retrieve the name of the column marked as primary key of a table of a dm object
reunite_parent_child Merge two tables linked by a foreign key relation
reexports Objects exported from other packages
cdm_flatten_to_tbl Flatten part of a dm into a wide table
cdm_get_all_fks Retrieve all foreign key constraints in a dm
cdm_select_tbl Select and rename tables
cdm_rm_tbl Remove tables from a dm
cdm_select Select and/or rename one or more columns of a dm table
dm Data model class
cdm_get_referencing_tables Get the names of a dm's tables referencing a given table.
check_cardinality Test if the relation between two tables of a data model meet the requirements
enum_pk_candidates Which columns are candidates for a primary key column?
cdm_is_referenced Is a dm's table referenced by another table?
cdm_get_fk Retrieve the name of the column marked as foreign key, pointing from one table of a dm to another
cdm_draw Draw schema of a dm-object's data model
cdm_get_all_pks Get all primary keys of a dm object
check_set_equality Test if the value sets of two different columns in two different tables are the same
cdm_rm_fk Remove reference(s) from one table of a dm to another
cdm_rm_pk Remove primary key from a table in a dm object
decompose_table Decompose a table into two linked tables
cdm_has_fk Does a reference from one table of a dm to another exist?
cdm_nycflights13 Creates a dm object for the nycflights13 data
cdm_join_to_tbl Perform a join between two tables of a dm
cdm_has_pk Does a table of a dm object have a column set as primary key?
cdm_nrow Number of rows
cdm_learn_from_db Learn a dm object from a database (DB)
check_key Test if column (combination) is unique key of table
cdm_rename Rename one or more columns of a dm table
check_if_subset Test if values of one column are a subset of values of another column
cdm_enum_fk_candidates Find foreign key candidates in a table
cdm_filter Filtering a dm object
cdm_add_pk Mark a column of a table in a dm object as its primary key
cdm_add_tbl Add tables to a dm
cdm_add_fk Add a reference from one table of a dm to another
cdm_disambiguate_cols Avoid column name conflicts
cdm_check_constraints Validate your dm: are all key constraints met?
cdm_copy_to Copy a dm-object to a src/con
browse_docs Browse package documentation
No Results!

Vignettes of dm

Name
abstracts/rstudio-conf-2020.md
abstracts/user-2019.md
wip/dm-db.Rmd
wip/dm-shiny-dynafilter.Rmd
wip/dm-tips-and-tricks.Rmd
dm-class-and-basic-operations.Rmd
dm-filtering.Rmd
dm-introduction-relational-data-models.Rmd
dm-joining.Rmd
dm-low-level.Rmd
dm-visualization.Rmd
dm.Rmd
setting_up_dm.Rmd
No Results!

Details

Date 2019-10-13
License MIT + file LICENSE
VignetteBuilder knitr
Remotes bergant/datamodelr
Encoding UTF-8
LazyData true
Roxygen list(markdown = TRUE, roclets = c("collate", "namespace", "rd"))
RoxygenNote 6.1.1

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/dm)](http://www.rdocumentation.org/packages/dm)