dm
The goal of {dm} is to provide tools for working with multiple tables.
Skip to the Features section if you are familiar with relational data models.
- Why? gives a short motivation, especially for {dplyr} users
- Good to Know explains important terms of relational data models
- Features gives a one-page overview over the scope of this package
- Example outlines some of the features in a short example
- More information offers links to more detailed articles
- Standing on the shoulders of giants shows related work
- Installation describes how to install the package
Why?
The motivation for the {dm} package is a more sophisticated data management. {dm} uses the relational data model and its core concept of splitting one table into multiple tables.
This has a hugh advantage: The code becomes simpler.
Example
As an example, we consider the
nycflights13
dataset. This
dataset contains five tables: the main flights
table with links into
the airlines
, planes
and airports
tables, and a weather
table
without an explicit link.
Assume your task is to merge all tables (except the weather
table).
This cross-referencing is a common first step when modelling or plotting
data.
In {dm} the basic element is a dm
object.
You can create it with cdm_nycflights13()
for the example data. After
that you can use the links between the tables as often as you wish -
without explicitly referring to the relations ever again. The task of
joining four tables (flights
, airlines
, planes
and airports
)
boils down to:
cdm_nycflights13() %>%
cdm_flatten_to_tbl(start = flights)
#> Renamed columns:
#> * year -> flights$flights.year, planes$planes.year
#> * name -> airlines$airlines.name, airports$airports.name
In contrast, using the classical {dplyr} notation you need three
left_join()
calls to merge the flights
table gradually to
airlines
, planes
and airports
tables to create one wide data
frame.
library(tidyverse)
library(nycflights13)
flights %>%
left_join(airlines, by = "carrier") %>%
left_join(planes, by = "tailnum") %>%
left_join(airports, by = c("origin" = "faa"))
You can find more information and important terms to jump-start working with {dm} in the article “Introduction to Relational Data Models”.
The Advantages in Brief
The separation into multiple tables achieves several goals:
- Avoid repetition, conserve memory: the information related to
each airline, airport, and airplane are stored only once
- name of each airline
- name, location and altitude of each airport
- manufacturer and number of seats for each airplane
- Improve consistency: for updating any information (e.g. the name of an airport), it is sufficient to update in only one place
- Segmentation: information is organized by topic, individual tables are smaller and easier to handle
Good to Know
Multiple, linked tables are a common concept in database management. Since many R users have a background in other disciplines, we present six important terms in relational data modeling to jump-start working with {dm}.
1) Data frames and tables
A data frame is a fundamental data structure in R. If you imagine it visually, the result is a typical table structure. That’s why working with data from spreadsheets is so convenient and users of the the popular {dplyr} package for data wrangling mainly rely on data frames.
The downside: Data frames and flat file systems like spreadsheets can result in bloated tables, that hold many repetitive values. Worst case, you have a data frame with multiple columns and in each row only a single value is different.
This calls for a better data organization by utilizing the resemblance between data frames and database tables, which consist of columns and rows, too. The elements are just called differently:
Data Frame | Table |
---|---|
Column | Attribute |
Row | Tuple |
Therefore, the separation into multiple tables is a first step that helps data quality. But without an associated data model you don’t take full advantage. For example, joining is more complicated than it should be. This is illustrated above.
With {dm} you can have the best of both worlds: Manage your data as linked tables, then flatten multiple tables into one for your analysis with {dplyr} on an as-needed basis.
2) Model
A data model shows the structure between multiple tables that can be
linked together. The nycflights13
relations can be transferred into
the following graphical representation:
The flights
table is linked to three other tables: airlines
,
planes
and airports
. By using directed arrows the visualization
explicitly shows the connection between different columns/attributes.
For example: The column carrier
in flights
can be joined with the
column carrier
from the airlines
table. Further Reading: The {dm}
methods for visualizing data
models.
The links between the tables are established through primary keys and foreign keys.
3) Primary Keys
In a relational data model every table needs to have one
column/attribute that uniquely identifies a row. This column is called
primary key (abbreviated with pk). The primary key column has unique
values and can’t contain NA
or NULL
values. If no such column
exists, it is common practice to create a synthetic column of numeric or
globally unique identifiers (surrogate key).
In the airlines
table of nycflights13
the column carrier
is the
primary key.
Further Reading: The {dm} package offers several function for dealing with primary keys.
4) Foreign Keys
The counterpart of a primary key in one table is the foreign key in another table. In order to join two tables, the primary key of the first table needs to be available in the second table, too. This second column is called the foreign key (abbreviated with fk).
For example, if you want to link the airlines
table in the
nycflights13
data to the flights
table, the primary key in the
airlines
table is carrier
which is present as foreign key carrier
in the flights
table.
Further Reading: The {dm} functions for working with foreign keys.
5) Normalization
One main goal is to keep the data organization as clean and simple as possible by avoiding redundant data entries. Normalization is the technical term that describes this central design principle of a relational data model: splitting data into multiple tables. A normalized data schema consists of several relations (tables) that are linked with attributes (columns) with primary and foreign keys.
For example, if you want to change the name of one airport in
nycflights13
, you have to change only a single data entry. Sometimes,
this principle is called “single point of truth”.
See the Wikipedia article on database normalization for more details. Consider reviewing the Simple English version for a gentle introduction.
6) Relational Databases
dm
is built upon relational data models, but it is not a database
itself. Databases are systems for data management and many of them are
constructed as relational databases, e.g. SQLite, MySQL, MSSQL,
Postgres. As you can guess from the names of the databases SQL, the
structured querying language plays an important role: It was
invented for the purpose of querying relational databases.
Therefore, {dm} can copy data from and to databases, and works transparently with both in-memory data and with relational database systems.
Features
This package helps with many challenges that arise when working with relational data models.
Compound object
The dm
class manages several related tables. It stores both the
data and the metadata in a compound object, and defines
operations on that object. These operations either affect the data
(e.g., a filter), or the metadata (e.g., definition of keys or creation
of a new table), or both.
- data: a table source storing all tables
- metadata: table names, column names, primary and foreign keys
This concept helps separating the join logic from the code: declare your relationships once, as part of your data, then use them in your code without repeating yourself.
Storage agnostic
The {dm} package augments {dplyr}/{dbplyr} workflows. Generally, if you can use {dplyr} on your data, it’s likely that you can use {dm} too. This includes local data frames, relational database systems, and many more.
Data preparation
A battery of utilities helps with creating a tidy relational data model.
- Splitting and rejoining tables
- Determining key candidates
- Checking keys and cardinalities
Example
A readymade dm
object with preset keys is included in the package:
library(dm)
cdm_nycflights13()
The cdm_draw()
function creates a visualization of the entity
relationship model:
cdm_nycflights13(cycle = TRUE) %>%
cdm_draw()
Filtering and joining
Similarly to dplyr::filter()
, a filtering function cdm_filter()
is
available for dm
objects. You need to provide the dm
object, the
table whose rows you want to filter, and the filter expression. The
actual effect of the filtering will only be realized once you use
cdm_apply_filters
. Before that, the filter conditions are merely
stored within the dm
. After using cdm_apply_filters()
a dm
object
is returned whose tables only contain rows that are related to the
reduced rows in the filtered table. This currently only works for
cycle-free relationships between the tables.
cdm_nycflights13(cycle = FALSE) %>%
cdm_get_tables() %>%
map_int(nrow)
#> airlines airports flights planes weather
#> 16 1458 336776 3322 26115
cdm_nycflights13(cycle = FALSE) %>%
cdm_filter(planes, year == 2000, manufacturer == "BOEING") %>%
cdm_apply_filters() %>%
cdm_get_tables() %>%
map_int(nrow)
#> airlines airports flights planes weather
#> 4 3 7301 134 26115
For joining two tables using their relationship defined in the dm
, you
can use cdm_join_tbl()
:
cdm_nycflights13(cycle = FALSE) %>%
cdm_join_to_tbl(airports, flights, join = semi_join)
In our dm
, the origin
column of the flights
table points to the
airports
table. Since all nycflights13
-flights depart from New York,
only these airports are included in the semi-join.
From and to databases
In order to transfer an existing dm
object to a DB, you can call
cdm_copy_to()
with the target DB and the dm
object:
src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src: sqlite 3.29.0 [:memory:]
#> tbls:
nycflights13_remote <- cdm_copy_to(src_sqlite, cdm_nycflights13(cycle = TRUE))
nycflights13_remote
The key constraints from the original object are also copied to the
newly created object. With the default setting set_key_constraints =
TRUE
for cdm_copy_to()
, key constraints are also established on the
target DB. Currently this feature is only supported for MSSQL and
Postgres database management systems (DBMS).
It is also possible to automatically create a dm
object from the
permanent tables of a DB. Again, for now just MSSQL and Postgres are
supported for this feature, so the next chunk is not evaluated. The
support for other DBMS will be implemented in a future update.
src_postgres <- src_postgres()
nycflights13_from_remote <- cdm_learn_from_db(src_postgres)
More information
If you would like to learn more about {dm}, the Intro article is a good place to start. Further resources:
- Function reference
- Introduction to Relational Data Models article
- Joining article
- Filtering article
- Class ‘dm’ and basic operations article
- Visualizing ‘dm’ objects article
- Low-level operations article
Standing on the shoulders of giants
This package follows the tidyverse principles:
dm
objects are immutable (your data will never be overwritten in place)- many functions used on
dm
objects are pipeable (i.e., return newdm
objects) - tidy evaluation is used (unquoted function parameters are supported)
The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!
The {polyply} package has a similar intent with a slightly different interface.
The {data.cube} package has
quite the same intent using array
-like interface.
Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.
The {tidygraph} package stores
a network as two related tables of nodes
and edges
, compatible with
{dplyr} workflows.
In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.
Installation
Once on CRAN, the package can be installed with
install.packages("dm")
Install the latest development version with
# install.packages("devtools")
devtools::install_github("krlmlr/dm")
License: MIT © cynkra GmbH.
Funded by:
Please note that the ‘dm’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.