Learn R Programming

⚠️There's a newer version (0.2.8) of this package.Take me there.

dm

{dm} is an R package that provides tools for working with multiple related tables, stored as data frames or in a relational database.

Contents

Background

Relational databases and flat tables, like data frames or spreadsheets, present data in fundamentally different ways.

In data frames and spreadsheets, all data is presented together in one large table with many rows and columns. This means that the data is accessible in one location but has the disadvantage that the same values may be repeated multiple times, resulting in bloated tables with redundant data. In the worst case scenario, a data frame may have many rows and columns but only a single value different in each row.

Relational databases, on the other hand, do not keep all data together but split it into multiple smaller tables. That separation into sub-tables has several advantages:

  • all information is stored only once, avoiding repetition and conserving memory
  • all information is updated only once and in one place, improving consistency and avoiding errors that may result from updating the same value in multiple locations
  • all information is organized by topic and segmented into smaller tables that are easier to handle

Separation of data, thus, helps with data quality, and explains the continuing popularity of relational databases in production-level data management.

The downside of this approach is that it is harder to merge together information from different data sources and to identify which entities refer to the same object, a common task when modelling or plotting data. To be mapped uniquely, the entities would need to be designated as keys, and the separate tables collated together through a process called joining.

In R, there already exist packages that support handling inter-linked tables but the code is complex and requires multiple command sequences. The goal of the {dm} package is to simplify the data management processes in R while keeping the advantages of relational data models and the core concept of splitting one table into multiple tables. In this way, you can have the best of both worlds: manage your data as a collection of linked tables, then flatten multiple tables into one for an analysis with {dplyr} or other packages, on an as-needed basis.

Although {dm} is built upon relational data models, it is not a database itself. It can work transparently with both relational database systems and in-memory data, and copy data from and to databases.

Example

As an example, consider the nycflights13 dataset about the flights that departed New York City airports in 2013. The dataset contains five tables: the main flights table with links to the airlines, planes and airports tables, and the weather table without explicit links.

Assume that your task is to merge all tables, except the weather table.

In the classical {dplyr} notation, you would need three left_join() calls to merge the flights table gradually to the airlines, planes and airports tables to create one wide data frame:

library(tidyverse)
library(nycflights13)

flights %>%
  left_join(airlines, by = "carrier") %>%
  left_join(planes, by = "tailnum") %>%
  left_join(airports, by = c("origin" = "faa"))

With the {dm} package, you would create a dm object. After that you would be able to use the links between the tables as often as you wish, without explicitly referring to the relations ever again.

For the example data set, you can use dm_nycflights13() to create the dm object, and a single command for merging the tables. The task of joining the four flights, airlines, planes and airports tables then boils down to:

library(dm)

dm_nycflights13() %>%
  dm_flatten_to_tbl(start = flights)
#> Renamed columns:
#> * year -> flights$flights.year, planes$planes.year
#> * name -> airlines$airlines.name, airports$airports.name

The example data model for {nycflights13} is integrated in {dm} and defines primary and foreign keys to identify the common points between the tables. For data other than the example data, the dm object would need to be created by using the dm() constructor and by adding keys using dm_add_pk() and dm_add_fk().

Features

The {dm} package helps with the challenges that arise with working with relational data models.

Compound object

The dm class manages several related tables. It stores both the data and the metadata in a compound object.

  • data: a table source storing all tables
  • metadata: table names, column names, primary and foreign keys

This concept helps separating the join logic from the code: declare your relationships once, as part of your data, then use them in your code without repeating yourself.

Various operations on dm objects are implemented. They either affect the data (e.g., a filter), or the metadata (e.g., definition of keys), or both (e.g., creation of a new table).

Storage agnostic

The {dm} package augments {dplyr}/{dbplyr} workflows. Generally, if you can use {dplyr} on your data, it’s likely that you can use {dm} too. This includes local data frames, relational database systems, and many more.

Data preparation

A battery of utilities helps with creating a tidy relational data model.

  • Splitting and rejoining tables
  • Determining key candidates
  • Checking keys and cardinalities

Filtering and joining

Similarly to dplyr::filter(), a filtering function dm_filter() is available for dm objects. You need to provide the dm object, the table whose rows you want to filter, and the filter expression. The actual effect of the filtering will only be realized once you use dm_apply_filters. Before that, the filter conditions are merely stored within the dm. After using dm_apply_filters() a dm object is returned whose tables only contain rows that are related to the reduced rows in the filtered table. This currently only works for cycle-free relationships between the tables.

dm_nycflights13(cycle = FALSE) %>%
  dm_get_tables() %>%
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>       16     1458   336776     3322    26115

dm_nycflights13(cycle = FALSE) %>%
  dm_filter(planes, year == 2000, manufacturer == "BOEING") %>%
  dm_apply_filters() %>%
  dm_get_tables() %>%
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>        4        3     7301      134    26115

For joining two tables using their relationship defined in the dm, you can use dm_join_to_tbl():

dm_nycflights13(cycle = FALSE) %>%
  dm_join_to_tbl(airports, flights, join = semi_join)

In our dm, the origin column of the flights table points to the airports table. Since all nycflights13-flights depart from New York, only these airports are included in the semi-join.

From and to databases

In order to transfer an existing dm object to a DB, you can call dm_copy_to() with the target DB and the dm object:

src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src:  sqlite 3.29.0 [:memory:]
#> tbls:
nycflights13_remote <- dm_copy_to(src_sqlite, dm_nycflights13(cycle = TRUE))
nycflights13_remote

The key constraints from the original object are also copied to the newly created object. With the default setting set_key_constraints = TRUE for dm_copy_to(), key constraints are also established on the target DB. Currently this feature is only supported for MSSQL and Postgres database management systems (DBMS).

It is also possible to automatically create a dm object from the permanent tables of a DB. Again, for now just MSSQL and Postgres are supported for this feature, so the next chunk is not evaluated. The support for other DBMS will be implemented in a future update.

src_postgres <- src_postgres()
nycflights13_from_remote <- dm_learn_from_db(src_postgres)

Installation

The latest (development) version of {dm} can be installed from GitHub.

# install.packages("devtools")
devtools::install_github("krlmlr/dm")

The {dm} package will also be made available on CRAN, from where it can be installed with the command

install.packages("dm")

More information

For an introduction into relational data models and to jump-start working with {dm}, please see the article “Introduction to Relational Data Models”.

If you would like to learn more about {dm}, the Intro article is a good place to start.

Further resources:

Standing on the shoulders of giants

The {dm} package follows the tidyverse principles:

  • dm objects are immutable (your data will never be overwritten in place)
  • many functions used on dm objects are pipeable (i.e., return new dm objects)
  • tidy evaluation is used (unquoted function parameters are supported)

The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!

The {polyply} package has a similar intent with a slightly different interface.

The {data.cube} package has quite the same intent using array-like interface.

Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.

The {tidygraph} package stores a network as two related tables of nodes and edges, compatible with {dplyr} workflows.

In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.


License: MIT © cynkra GmbH.

Funded by:


Please note that the ‘dm’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('dm')

Monthly Downloads

2,933

Version

0.0.6.9000

License

MIT + file LICENSE

Maintainer

Kirill M<c3><bc>ller

Last Published

April 8th, 2022

Functions in dm (0.0.6.9000)

dm_flatten_to_tbl

Flatten a part of a dm into a wide table
dm_get_all_pks

dm_get_all_fks

dm_add_pk

dm_filter

dm_check_constraints

Validate your dm: are all key constraints met?
dm_add_tbl

dm_get_fk

dm_draw

Draw a diagram of a dm-object's data model
dm_enum_fk_candidates

Find foreign key candidates in a table
dm_get_pk

Retrieve the name of the primary key columns of a dm table
dm_learn_from_db

Learn a dm object from a database (DB)
dm_rm_pk

dm_rm_fk

dm_join_to_tbl

dm_paste

dm_has_pk

Does a table of a dm object have columns set as primary key?
dm_select_tbl

Select and rename tables
dm_is_referenced

Is a table of a dm referenced by another table?
dm_nycflights13

Creates a dm object for the nycflights13 data
pull_tbl

Retrieve a table from a dm or a zoomed_dm
dm_zoom_to_tbl

Single out a table of a dm
dm_nrow

Number of rows
enum_pk_candidates

Which columns are candidates for a primary key column?
dm_rename

reexports

Objects exported from other packages
dm_disambiguate_cols

Avoid conflicts in column names
dm_get_referencing_tables

Get the names of the tables of a dm that reference a given table.
dm_copy_to

Copy a dm-object to a different data source
reunite_parent_child

Merge two tables that are linked by a foreign key relation
dm_has_fk

Does a reference exist?
dm_select

dm_rm_tbl

deprecated

Deprecated functions
decompose_table

Decompose a table into two linked tables
check_set_equality

Test if the value sets of two different columns in two different tables are the same
dm_add_fk

dm

Data model class
check_if_subset

Test if the values of one column are a subset of the values of another column
check_key

Test if a column (combination) is unique key of a table
browse_docs

Browse package documentation
check_cardinality

Test if the relation between two tables of a data model meet the requirements
check_subset

Test if the values of one column are a subset of the values of another column
copy_dm_to

Copy a dm-object to a different data source
dm_zoom_to

Single out a table of a dm
dm_examine_constraints

Validate your dm: are all key constraints met?
dm_get_filters

Get filter expressions
examine_cardinality

Test if the relation between two tables of a data model meet the requirements