sdc_descriptives: Disclosure control for descriptive statistics

Description

Checks the number of distinct entities and the (n, k) dominance rule for your descriptive statistics.

That means that sdc_descriptives() checks if there are at least 5 distinct entities and if the largest 2 entities account for 85% or more of val_var. The parameters can be changed using options. For details see vignette("options", package = "sdcLog").

Usage

sdc_descriptives(
  data,
  id_var = getOption("sdc.id_var"),
  val_var = NULL,
  by = NULL,
  zero_as_NA = NULL,
  fill_id_var = FALSE
)

Arguments

data

data.frame from which the descriptive statistics are calculated.

id_var

character The name of the id variable. Defaults to getOption("sdc.id_var") so that you can provide options(sdc.id_var = "my_id_var") at the top of your script.

val_var

character vector of value variables on which descriptive statistics are computed.

character vector of grouping variables.

zero_as_NA

logical If TRUE, zeros in 'val_var' are treated as NA.

fill_id_var

logical Only for very specific use cases. For example:

id_var contains NA values which represent missing values in the sense that there actually exist values identifying the entity but are unknown (or deleted for privacy reasons).
id_var contains NA values which result from the fact that an observation features more than one confidential identifier and not all of these identifiers are present in each observation. Examples for such identifiers are the role of a broker in a security transaction or the role of a collateral giver in a credit relationship.

If TRUE, NA values within id_var will internally be filled with <filled_[i]>, assuming that all NA values of id_var can be treated as different small entities for statistical disclosure control purposes. Thus, set TRUE only if this is a reasonable assumption.

Defaults to FALSE.

Value

A list of class sdc_descriptives with detailed information about options, settings, and compliance with the criteria distinct entities and dominance.

Details

The general form of the (n, k) dominance rule can be formulated as:

_i=1^nx_i > k100 _i=1^Nx_i

where x_1 x_2 x_N. n denotes the number of largest contributions to be considered, x_n the n-th largest contribution, k the maximal percentage these n contributions may account for, and N is the total number of observations.

If the statement above is true, the (n, k) dominance rule is violated.

Examples

Run this code

# NOT RUN {
sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1"
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1",
  by = "sector"
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1",
  by = c("sector", "year")
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year")
)

sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year"),
  zero_as_NA = FALSE
)

# }

Run the code above in your browser using DataLab