starschemar

The multidimensional data model was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure.

Transformations can be carried out using professional ETL (extract, transform and load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way.

Frequently, the operations to be performed aim to transform a flat table (with data that comes from operational systems) into a star schema (which implements a multidimensional system). With the tools mentioned above, this transformation can be carried out, but it requires a lot of work. I am not aware of any tools with operations designed to specifically support this transformation process.

The goal of starschemar is to define transformations that allow you to easily obtain star schemas from flat tables. In addition, it includes basic data cleaning, dimension enrichment, incremental data refresh and query operations, adapted to this context.

Based on this package, the rolap package has been developed. It incorporates the main functionalities for which starschemar was initially intended and it has a new data model defined with the aim of facilitating its maintenance and evolution.

Installation

You can install the released version of starschemar from CRAN with:

install.packages("starschemar")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("josesamos/starschemar")

Example

To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System data set in the form of a flat table, the first rows of which are shown below.

YearWEEKWeek Ending DateREGIONStateCityPneumonia and Influenza DeathsAll DeathsOther Deaths
196211962-01-061CTBridgeport34643
196221962-01-131CTBridgeport24341
196231962-01-201CTBridgepor24038
196211962-01-069WATacoma45046
NANA1962-01-139WATacoma24543
196231962-01-209WATacoma03939

The original data has been modified to have some erroneous or missing data.

The transformation to obtain a star schema from the flat table using starschemar package is as follows:

library(starschemar)

# columns to consider in the definition
dput(colnames(ft))
#> c("Year", "WEEK", "Week Ending Date", "REGION", "State", "City", 
#> "Pneumonia and Influenza Deaths", "All Deaths", "Other Deaths"
#> )

dm <- dimensional_model() |>
  define_fact(
    name = "mrs_cause",
    measures = c(
      "Pneumonia and Influenza Deaths",
      "Other Deaths"
    ),
  ) |>
  define_dimension(
    name = "when",
    attributes = c(
      "Week Ending Date",
      "WEEK",
      "Year"
    )
  ) |>
  define_dimension(
    name = "where",
    attributes = c(
      "REGION",
      "State",
      "City"
    )
  )

st <- star_schema(ft, dm) |>
  snake_case() |>
  character_dimensions(
    NA_replacement_value = "Unknown",
    length_integers = list(week = 2)
  )

The tables of dimensions and facts of the obtained star schema are shown below.

when_keyweek_ending_dateweekyear
11962-01-06011962
21962-01-13021962
31962-01-13UnknownUnknown
41962-01-20031962
where_keyregionstatecity
11CTBridgepor
21CTBridgeport
39WATacoma
when_keywhere_keypneumonia_and_influenza_deathsother_deathsnrow_agg
123431
134461
222411
332431
412381
430391

The tables show the erroneous and missing data. We are going to perform some data cleaning operations to correct them.

where <- st |>
  get_dimension("where")

when <- st |>
  get_dimension("when")

updates <- record_update_set() |>
  update_selection(
    dimension = where,
    columns = c("city"),
    old_values = c("Bridgepor"),
    new_values = c("Bridgeport")
  ) |>
  match_records(dimension = when,
                old = 3,
                new = 2)

st <- st |>
  modify_dimension_records(updates)

The new dimension and fact tables are shown below.

when_keyweek_ending_dateweekyear
11962-01-06011962
21962-01-13021962
31962-01-20031962
where_keyregionstatecity
11CTBridgeport
29WATacoma
where_keywhen_keypneumonia_and_influenza_deathsother_deathsnrow_agg
113431
122411
132381
214461
222431
230391

In addition to the operations in the examples shown, starschemar offers support for defining role playing and role dimensions in a star schema, fact constellations with conformed dimensions, dimension enrichment, incremental data refresh operations, the ability to export results in various tibble-based formats, and query operations, adapted to this context.

Copy Link

Version

Down Chevron

Install

install.packages('starschemar')

Monthly Downloads

209

Version

1.2.4

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

January 8th, 2024

Functions in starschemar (1.2.4)