PLSrounding: PLS inspired rounding

Description

Small count rounding of necessary inner cells are performed so that all small frequencies of cross-classifications to be published (publishable cells) are rounded. The publishable cells can be defined from a model formula, hierarchies or automatically from data.

Usage

PLSrounding(
  data,
  freqVar = NULL,
  roundBase = 3,
  hierarchies = NULL,
  formula = NULL,
  dimVar = NULL,
  maxRound = roundBase - 1,
  printInc = nrow(data) > 1000,
  output = NULL,
  extend0 = FALSE,
  preAggregate = is.null(freqVar),
  aggregatePackage = "base",
  aggregateNA = TRUE,
  aggregateBaseOrder = FALSE,
  rowGroupsPackage = aggregatePackage,
  ...
)
PLSroundingInner(..., output = "inner")
PLSroundingPublish(..., output = "publish")

Value

Output is a four-element list with class attribute "PLSrounded", which ensures informative printing and enables the use of FormulaSelection on this object.

inner: Data frame corresponding to input data with the main dimensional variables and with cell frequencies (original, rounded, difference).
publish: Data frame of publishable data with the main dimensional variables and with cell frequencies (original, rounded, difference).
metrics: A named character vector of various statistics calculated from the two output data frames ("inner_" used to distinguish). See examples below and the function HDutility.
freqTable: Matrix of frequencies of cell frequencies and absolute differences. For example, row "rounded" and column "inn.4+" is the number of rounded inner cell frequencies greater than or equal to 4.

Arguments

data: Input data (inner cells), typically a data frame, tibble, or data.table. If data is not a classic data frame, it will be coerced to one internally unless preAggregate is TRUE and aggregatePackage is "data.table".
freqVar: Variable holding counts (inner cells frequencies). When NULL (default), microdata is assumed.
roundBase: Rounding base
hierarchies: List of hierarchies
formula: Model formula defining publishable cells
dimVar: The main dimensional variables and additional aggregating variables. This parameter can be useful when hierarchies and formula are unspecified.
maxRound: Inner cells contributing to original publishable cells equal to or less than maxRound will be rounded
printInc: Printing iteration information to console when TRUE
output: Possible non-NULL values are "input", "inner" and "publish". Then a single data frame is returned.
extend0: When extend0 is set to TRUE, the data is automatically extended. This is relevant when zeroCandidates = TRUE (see RoundViaDummy). Additionally, extend0 can be specified as a list, representing the varGroups parameter in the Extend0 function. Can also be set to "all" which means that input codes in hierarchies are considered in addition to those in data.
preAggregate: When TRUE, the data will be aggregated beforehand within the function by the dimensional variables.
aggregatePackage: Package used to preAggregate. Parameter pkg to aggregate_by_pkg.
aggregateNA: Whether to include NAs in the grouping variables while preAggregate. Parameter include_na to aggregate_by_pkg.
aggregateBaseOrder: Parameter base_order to aggregate_by_pkg, used when preAggregate. The default is set to FALSE to avoid unnecessary sorting operations. When TRUE, an attempt is made to return the same result with data.table as with base R. This cannot be guaranteed due to potential variations in sorting behavior across different systems.
rowGroupsPackage: Parameter pkg to RowGroups. The parameter is input to Formula2ModelMatrix via ModelMatrix.
...: Further parameters sent to RoundViaDummy

Details

This function is a user-friendly wrapper for RoundViaDummy with data frame output and with computed summary of the results. See RoundViaDummy for more details.

References

Langsrud, Ø. and Heldal, J. (2018): “An Algorithm for Small Count Rounding of Tabular Data”. Presented at: Privacy in statistical databases, Valencia, Spain. September 26-28, 2018. https://www.researchgate.net/publication/327768398_An_Algorithm_for_Small_Count_Rounding_of_Tabular_Data

Examples

Run this code

# Small example data set
z <- SmallCountData("e6")
print(z)

# Publishable cells by formula interface
a <- PLSrounding(z, "freq", roundBase = 5,  formula = ~geo + eu + year)
print(a)
print(a$inner)
print(a$publish)
print(a$metrics)
print(a$freqTable)

# Using FormulaSelection()
FormulaSelection(a$publish, ~eu + year)
FormulaSelection(a, ~eu + year) # same as above
FormulaSelection(a)             # just a$publish

# Recalculation of maxdiff, HDutility, meanAbsDiff and rootMeanSquare
max(abs(a$publish[, "difference"]))
HDutility(a$publish[, "original"], a$publish[, "rounded"])
mean(abs(a$publish[, "difference"]))
sqrt(mean((a$publish[, "difference"])^2))

# Five lines below produce equivalent results 
# Ordering of rows can be different
PLSrounding(z, "freq", dimVar = c("geo", "eu", "year"))
PLSrounding(z, "freq", formula = ~eu * year + geo * year)
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eHrc"))
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eDimList"))
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eDimList"), formula = ~geo * year)

# Define publishable cells differently by making use of formula interface
PLSrounding(z, "freq", formula = ~eu * year + geo)

# Define publishable cells differently by making use of hierarchy interface
eHrc2 <- list(geo = c("EU", "@Portugal", "@Spain", "Iceland"), year = c("2018", "2019"))
PLSrounding(z, "freq", hierarchies = eHrc2)

# Also possible to combine hierarchies and formula
PLSrounding(z, "freq", hierarchies = SmallCountData("eDimList"), formula = ~geo + year)

# Single data frame output
PLSroundingInner(z, "freq", roundBase = 5, formula = ~geo + eu + year)
PLSroundingPublish(z, roundBase = 5, formula = ~geo + eu + year)

# Microdata input
PLSroundingInner(rbind(z, z), roundBase = 5, formula = ~geo + eu + year)

# Zero perturbed due to both  extend0 = TRUE and zeroCandidates = TRUE 
set.seed(12345)
PLSroundingInner(z[sample.int(5, 12, replace = TRUE), 1:3], 
                 formula = ~geo + eu + year, roundBase = 5, 
                 extend0 = TRUE, zeroCandidates = TRUE, printInc = TRUE)

# Parameter avoidHierarchical (see RoundViaDummy and ModelMatrix) 
PLSroundingPublish(z, roundBase = 5, formula = ~geo + eu + year, avoidHierarchical = TRUE)


# To illustrate hierarchical_extend0 
#    (parameter to underlying function, SSBtools::Extend0fromModelMatrixInput)
PLSroundingInner(z[-c(2:3), ], roundBase = 5, formula = ~geo + eu + year, 
   avoidHierarchical = TRUE, zeroCandidates = TRUE, extend0 = TRUE)
PLSroundingInner(z[-c(2:3), ], roundBase = 5, formula = ~geo + eu + year, 
   avoidHierarchical = TRUE, zeroCandidates = TRUE, extend0 = TRUE, 
   hierarchical_extend0 = TRUE)

# Package sdcHierarchies can be used to create hierarchies. 
# The small example code below works if this package is available. 
if (require(sdcHierarchies)) {
  z2 <- cbind(geo = c("11", "21", "22"), z[, 3:4], stringsAsFactors = FALSE)
  h2 <- list(
    geo = hier_compute(inp = unique(z2$geo), dim_spec = c(1, 1), root = "Tot", as = "df"),
    year = hier_convert(hier_create(root = "Total", nodes = c("2018", "2019")), as = "df"))
  PLSrounding(z2, "freq", hierarchies = h2)
}

# Use PLS2way to produce tables as in Langsrud and Heldal (2018) and to demonstrate 
# parameters maxRound, zeroCandidates and identifyNew (see RoundViaDummy).   
# Parameter rndSeed used to ensure same output as in reference.
exPSD <- SmallCountData("exPSD")
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, rndSeed=124)
PLS2way(a, "original")  # Table 1
PLS2way(a)  # Table 2
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, identifyNew = FALSE, rndSeed=124)
PLS2way(a)  # Table 3
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, maxRound = 7)
PLS2way(a)  # Values in col1 rounded
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, zeroCandidates = TRUE)
PLS2way(a)  # (row3, col4): original is 0 and rounded is 5

# Using formula followed by FormulaSelection 
output <- PLSrounding(data = SmallCountData("example1"), 
                      formula = ~age * geo * year + eu * year, 
                      freqVar = "freq", 
                      roundBase = 5)
FormulaSelection(output, ~(age + eu) * year)

# Example similar to the one in the documentation of tables_by_formulas,
# but using PLSroundingPublish with roundBase = 4.
tables_by_formulas(SSBtoolsData("magnitude1"),
                   table_fun = PLSroundingPublish, 
                   table_formulas = list(table_1 = ~region * sector2, 
                                         table_2 = ~region1:sector4 - 1, 
                                         table_3 = ~region + sector4 - 1), 
                   substitute_vars = list(region = c("geo", "eu"), region1 = "eu"), 
                   collapse_vars = list(sector = c("sector2", "sector4")), 
                   roundBase = 4)

Run the code above in your browser using DataLab