semantic_enrichment: Semantic enrichment

Description

Enriches a dataset with additional (meta-)variables derived from the semantic commonalities between variables (columns).

Usage

semantic_enrichment(
  data,
  ontology,
  mapping_file,
  mode = "in",
  root,
  label_attr = "name",
  ...
)

Value

Semantically enriched dataset

Arguments

data

Required. Numeric data frame or matrix containing variables present in the mapping file.

ontology

Required. One of:

Edge table in data frame format
Graph containing the chosen ontology - must be in tidygraph format or coercible to this format.

mapping_file

Required. Path to csv file or data frame containing mapping information. Should contain two columns only. The first column should contain column names, present in the data frame. The second column should contain the name of entities present in the ontology object.

mode

Character constant specifying the directionality of the edges. One of: "in" or "out".

root

Required. Name of root node identifier in column 1 to calculate node depth from.

label_attr

Node attribute containing labels used for column names when creating metavariable aggregations. Default: "name"

...

additional arguments to pass to read_csv when reading `mapping_file`.

Details

Semantic enrichment generates meta-variables from the aggregation of data variables (columns) via their most informative common ancestor. Meta-variables are labelled using the syntax: MV_[label_attr]_[Aggregation function]. The data variables are aggregated row-wise by their maximum, minimum, mean, sum, and product. Meta-variables with zero entropy (no information) are not appended to the data. See the "Semantic Enrichment" section in the vignette of 'eHDPrep' for more information: vignette("Introduction_to_eHDPrep", package = "eHDPrep")

Examples

Run this code

require(magrittr)
require(dplyr)
data(example_ontology)
data(example_mapping_file)
data(example_data)

#' # define datatypes
tibble::tribble(~"var", ~"datatype",
"patient_id", "id",
"tumoursize", "numeric",
"t_stage", "ordinal_tstage",
"n_stage", "ordinal_nstage",
"diabetes_merged", "character",
"hypertension", "factor",
"rural_urban", "factor",
"marital_status", "factor",
"SNP_a", "genotype",
"SNP_b", "genotype",
"free_text", "freetext") -> data_types

# create post-QC data
example_data %>%
  merge_cols(diabetes_type, diabetes, "diabetes_merged", rm_in_vars = TRUE) %>%
  apply_quality_ctrl(patient_id, data_types,
                     bin_cats =c("No" = "Yes", "rural" = "urban"),
                     to_numeric_matrix = TRUE) %>%
                     suppressMessages() ->
                     post_qc_data

# minimal example on first four coloums of example data:
semantic_enrichment(post_qc_data[1:10,1:4],
                    dplyr::slice(example_ontology, 1:7,24),
                    example_mapping_file[1:3,], root = "root") -> res
# see Note section of documentation for information on possible warnings.

# summary of result:
tibble::glimpse(res)

# \donttest{
# full example:
 res <- semantic_enrichment(post_qc_data, example_ontology,
 example_mapping_file, root = "root")
 # see Note section of documentation for information on possible warnings.
# }