starschemar v1.2.0

0

Monthly downloads

0th

Percentile

Obtaining Star Schemas from Flat Tables

Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure. Frequently, the operations to be performed aim to transform a flat table into a star schema. Transformations can be carried out using professional ETL (extract, transform and load) tools or tools intended for data transformation for end users. With the tools mentioned, this transformation can be carried out, but it requires a lot of work. The main objective this package is to define transformations that allow obtaining star schemas from flat tables easily. In addition, it includes basic data cleaning, dimension enrichment, incremental data refresh and query operations, adapted to this context.

Readme

starschemar

Travis build
status

The multidimensional data model was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure.

Transformations can be carried out using professional ETL (extract, transform and load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way.

Frequently, the operations to be performed aim to transform a flat table (with data that comes from operational systems) into a star schema (which implements a multidimensional system). With the tools mentioned above, this transformation can be carried out, but it requires a lot of work.

The goal of starschemar is to define transformations that allow you to easily obtain star schemas from flat tables. In addition, it includes basic data cleaning, dimension enrichment, incremental data refresh and query operations, adapted to this context.

Installation

You can install the released version of starschemar from CRAN with:

install.packages("starschemar")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("josesamos/starschemar")

Example

To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System data set in the form of a flat table, the first rows of which are shown below.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths Other Deaths
1962 1 1962-01-06 1 CT Bridgeport 3 46 43
1962 2 1962-01-13 1 CT Bridgeport 2 43 41
1962 3 1962-01-20 1 CT Bridgepor 2 40 38
1962 1 1962-01-06 9 WA Tacoma 4 50 46
NA NA 1962-01-13 9 WA Tacoma 2 45 43
1962 3 1962-01-20 9 WA Tacoma 0 39 39

The original data has been modified to have some erroneous or missing data.

The transformation to obtain a star schema from the flat table using starschemar package is as follows:

library(starschemar)
library(tidyr)

# columns to consider in the definition
dput(colnames(ft))
#> c("Year", "WEEK", "Week Ending Date", "REGION", "State", "City", 
#> "Pneumonia and Influenza Deaths", "All Deaths", "Other Deaths"
#> )

dm <- dimensional_model() %>%
  define_fact(
    name = "mrs_cause",
    measures = c(
      "Pneumonia and Influenza Deaths",
      "Other Deaths"
    ),
  ) %>%
  define_dimension(
    name = "when",
    attributes = c(
      "Week Ending Date",
      "WEEK",
      "Year"
    )
  ) %>%
  define_dimension(
    name = "where",
    attributes = c(
      "REGION",
      "State",
      "City"
    )
  )

st <- star_schema(ft, dm) %>%
  snake_case() %>%
  character_dimensions(
    NA_replacement_value = "Unknown",
    length_integers = list(week = 2)
  )

The tables of dimensions and facts of the obtained star schema are shown below.

when_key week_ending_date week year
1 1962-01-06 01 1962
2 1962-01-13 02 1962
3 1962-01-13 Unknown Unknown
4 1962-01-20 03 1962
where_key region state city
1 1 CT Bridgepor
2 1 CT Bridgeport
3 9 WA Tacoma
when_key where_key pneumonia_and_influenza_deaths other_deaths nrow_agg
1 2 3 43 1
1 3 4 46 1
2 2 2 41 1
3 3 2 43 1
4 1 2 38 1
4 3 0 39 1

The tables show the erroneous and missing data. We are going to perform some data cleaning operations to correct them.

where <- st %>%
  get_dimension("where")

when <- st %>%
  get_dimension("when")

updates <- record_update_set() %>%
  update_selection(
    dimension = where,
    columns = c("city"),
    old_values = c("Bridgepor"),
    new_values = c("Bridgeport")
  ) %>%
  match_records(dimension = when,
                old = 3,
                new = 2)

st <- st %>%
  modify_dimension_records(updates)

The new dimension and fact tables are shown below.

when_key week_ending_date week year
1 1962-01-06 01 1962
2 1962-01-13 02 1962
3 1962-01-20 03 1962
where_key region state city
1 1 CT Bridgeport
2 9 WA Tacoma
where_key when_key pneumonia_and_influenza_deaths other_deaths nrow_agg
1 1 3 43 1
1 2 2 41 1
1 3 2 38 1
2 1 4 46 1
2 2 2 43 1
2 3 0 39 1

In addition to the operations in the examples shown, starschemar offers support for defining role playing and role dimensions in a star schema, fact constellations with conformed dimensions, dimension enrichment, incremental data refresh operations, the ability to export results in various tibble-based formats, and query operations, adapted to this context.

Functions in starschemar

Name Description
ct_mrs_test Constellation for Mortality Reporting System Test
incremental_refresh_fact Incrementally refresh a fact table with another
set_dimension_type Set the dimension type
dimensional_model dimensional_model S3 class
character_dimensions Transform dimension numeric attributes to character
character_dimension Transform a dimension numeric attributes to character
new_multistar multistar S3 class
get_star_schema Get star schema
define_dimension Define dimensions in a dimensional_model object
constellation_as_tibble_list Export a constellation as a tibble list
enrich_dimension_export Export selected attributes of a dimension
dimensional_query dimensional_query S3 class
get_dimension_name Get the dimension name
ct_mrs Constellation for Mortality Reporting System
get_measure_names Get measure names
delete_unused_foreign_keys Delete unused foreign keys
find_values Find values in a dimension
homogenize Homogenize a dimension
dereference_dimension Dereference a dimension
ft_usa_states USA States
get_dimension Get dimension
incremental_refresh_dimension Incrementally refresh a dimension with another
get_dimension_attribute_names Get dimension attribute names
enrich_dimension_import Import tibble to enrich a dimension
get_dimension_names Get dimension names
incremental_refresh_constellation Incrementally refresh a constellation with a star schema
get_role_playing_dimension_name Get the associated role-playing dimension name
ft_datagov_uk Modelling the long-term health impacts of air pollution in London
new_fact_table fact_table S3 class
get_role_dimension_names Get role dimension names associated to a role-playing dimension
mrs_cause_w_test Mortality Reporting System by Cause for Week Test
ms_mrs_test Multistar for Mortality Reporting System Test
mrs_age_test Mortality Reporting System by Age Test
define_fact Define facts in a dimensional_model object
get_name_of_role_playing_dimensions Get the name of the role playing dimensions
multistar_as_flat_table Export a multistar as a flat table
set_fact_name Set fact name
is_role_playing_dimension Is it role-playing dimension?
get_name_of_uniquely_implemented_dimensions Get name of uniquely implemented dimensions
mrs_age_w10 Mortality Reporting System by Age for Week 10
filter_dimension Filter dimension
update_facts_with_role_dimension Update facts with a role dimension
get_dimension_type Get the dimension type
update_dimension Apply dimension record update operations to a dimension
delete_records Delete records
modify_conformed_dimension_records Apply dimension record update operations to conformed dimensions
snake_case Transform names according to the snake case style
new_record_update record_update S3 class
update_dimensions Apply update operations to dimensions
role_playing_dimension Define a role playing dimension in a star_schema object
modify_dimension_records Apply dimension record update operations
ft_usa_city_county USA City and County
replace_dimension_in_facts Replace in facts a star schema dimension
replace_general_dimension_in_facts Replace in facts a star schema general dimension
union_of_dimensions Perform union of dimensions
define_selected_dimensions Define selected dimensions
set_dimension_name Set the dimension name
snake_case_dimension Transform names according to the snake case style in a dimension
ft_london_boroughs London Boroughs
ms_mrs Multistar for Mortality Reporting System
get_conformed_dimension_names Get conformed dimension names
group_table Group the records in the table
rename_measures Rename measures
update_facts_with_dimensions Update facts with a list of modified dimensions
replace_dimension Replace a star schema dimension
role_dimension Transform a dimension into a role dimension
get_conformed_dimension_names_st Get conformed dimension names
new_record_update_set record_update_set S3 class
purge_dimensions_star_schema Purge dimensions
st_mrs_cause Star Schema for Mortality Reporting System by Cause
mrs_cause Mortality Reporting System by Cause
group_records Group records
constellation constellation S3 class
get_all_dimensions Get all dimensions
dm_mrs_cause Star Definition for Mortality Reporting System by Cause
dm_mrs_age Star Definition for Mortality Reporting System by Age
is_dimension_in_updates Is dimension in set of updates?
constellation_as_multistar Export a constellation as a multistar
st_mrs_cause_test Star Schema for Mortality Reporting System by Cause Test
new_star_schema star_schema S3 class
st_mrs_age_w11 Star Schema for Mortality Reporting System by Age for Week 11
st_mrs_age Star Schema for Mortality Reporting System by Age
filter_fact_rows Filter fact rows
get_attribute_names Get attribute names
set_dimension_type_conformed Set the type of a conformed dimension
mrs_age_w11 Mortality Reporting System by Age for Week 11
prepare_join Transform a tibble to join
get_fact_name Get fact name
filter_selected_instances Filter selected instances
star_schema_as_multistar Export a star schema as a multistar
define_selected_facts Define selected facts
conform_all_dimensions Conform all dimensions of a constellation
incremental_refresh_star_schema Incrementally refresh a star schema with another
update_facts_with_general_dimension Update facts with a general dimension
snake_case_fact Transform names according to the snake case style in a fact table
is_role_dimension Is it role dimension?
mrs_cause_w10 Mortality Reporting System by Cause for Week 10
mrs_age_w_test Mortality Reporting System by Age for Week Test
st_mrs_age_w_test Star Schema for Mortality Reporting System by Age for Week Test
mrs_cause_w11 Mortality Reporting System by Cause for Week 11
enrich_dimension_import_test Import tibble to test to enrich a dimension
conform_dimensions Conform dimensions of given name
mrs Mortality Reporting System
replace_records Replace records
purge_dimensions_constellation Purge dimensions in a constellation
group_facts Group facts
record_update_set record_update_set S3 class
match_records Make a dimension record equal to another
run_query Run query
rename_dimension Rename dimension
get_conformed_dimension Get conformed dimension
get_star_schema_names Get star schema names
new_constellation constellation S3 class
select_dimension Select dimension
new_dimensional_model dimensional_model S3 class
is_conformed_dimension Is it conformed dimension?
mrs_cause_test Mortality Reporting System by Cause Test
reference_dimension Reference a dimension
update_record Update a dimension record with a set of values
mrs_age Mortality Reporting System by Age
star_schema star_schema S3 class
remove_duplicate_dimension_rows Remove duplicate dimension rows
replace_role_dimension_in_facts Replace in facts a star schema role dimension
set_dimension_type_role_playing Set the type of a role-playing dimension
set_role_playing_dimension_name Set the associated role-playing dimension name
st_mrs_age_test Star Schema for Mortality Reporting System by Age Test
new_dimensional_query dimensional_query S3 class
new_dimension_table dimension_table S3 class
rename_dimension_attributes Rename dimension attributes
rename_fact Rename fact
update_selection_general Update dimension records with a set of values in given columns
select_fact Select fact
star_schema_as_tl Export a star schema as a tibble list (common)
selection_bit_map Generate a record selection bitmap
st_mrs_age_w10 Star Schema for Mortality Reporting System by Age for Week 10
update_selection Update dimension records with a set of values
starschemar Obtaining Star Schemas from Flat Tables
unify_facts_by_grain Unify facts by grain
typed_value Transform a value according to its type
st_mrs_cause_w_test Star Schema for Mortality Reporting System by Cause for Week Test
st_mrs_cause_w10 Star Schema for Mortality Reporting System by Cause for Week 10
star_schema_as_flat_table Export a star schema as a flat table
st_mrs_cause_w11 Star Schema for Mortality Reporting System by Cause for Week 11
updates_st_mrs_age_test Updates for the Star Schema for Mortality Reporting System by Age Test
star_schema_as_mst Star schema as multistar export (common)
star_schema_as_tibble_list Export a star schema as a tibble list
updates_st_mrs_age Updates for the Star Schema for Mortality Reporting System by Age
No Results!

Vignettes of starschemar

Name
bibliography.bib
starschemar.Rmd
No Results!

Last month downloads

Details

License MIT + file LICENSE
Encoding UTF-8
Language en-GB
LazyData true
RoxygenNote 7.1.1
VignetteBuilder knitr
NeedsCompilation no
Packaged 2020-09-25 20:56:52 UTC; jsamos
Repository CRAN
Date/Publication 2020-09-25 21:30:02 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/starschemar)](http://www.rdocumentation.org/packages/starschemar)