Small count rounding of necessary inner cells are performed so that all small frequencies of cross-classifications to be published (publishable cells) are rounded. The publishable cells can be defined from a model formula, hierarchies or automatically from data.
PLSrounding(
data,
freqVar = NULL,
roundBase = 3,
hierarchies = NULL,
formula = NULL,
dimVar = NULL,
maxRound = roundBase - 1,
printInc = nrow(data) > 1000,
output = NULL,
extend0 = FALSE,
preAggregate = is.null(freqVar),
aggregatePackage = "base",
aggregateNA = TRUE,
aggregateBaseOrder = FALSE,
rowGroupsPackage = aggregatePackage,
...
)PLSroundingInner(..., output = "inner")
PLSroundingPublish(..., output = "publish")
Output is a four-element list with class attribute "PLSrounded",
which ensures informative printing and enables the use of FormulaSelection
on this object.
Data frame corresponding to input data with the main dimensional variables and with cell frequencies (original, rounded, difference).
Data frame of publishable data with the main dimensional variables and with cell frequencies (original, rounded, difference).
A named character vector of various statistics calculated from the two output data frames
("inner_
" used to distinguish). See examples below and the function HDutility
.
Matrix of frequencies of cell frequencies and absolute differences.
For example, row "rounded
" and column "inn.4+
" is the number of rounded
inner cell frequencies greater than or equal to 4
.
Input data (inner cells), typically a data frame, tibble, or data.table.
If data
is not a classic data frame, it will be coerced to one internally
unless preAggregate
is TRUE
and aggregatePackage
is "data.table"
.
Variable holding counts (inner cells frequencies). When NULL
(default), microdata is assumed.
Rounding base
List of hierarchies
Model formula defining publishable cells
The main dimensional variables and additional aggregating variables. This parameter can be useful when hierarchies and formula are unspecified.
Inner cells contributing to original publishable cells equal to or less than maxRound will be rounded
Printing iteration information to console when TRUE
Possible non-NULL values are "input"
, "inner"
and "publish"
. Then a single data frame is returned.
When extend0
is set to TRUE
, the data is automatically extended.
This is relevant when zeroCandidates = TRUE
(see RoundViaDummy
).
Additionally, extend0
can be specified as a list, representing the varGroups
parameter
in the Extend0
function.
Can also be set to "all"
which means that input codes in hierarchies are considered in addition to those in data.
When TRUE
, the data will be aggregated beforehand within the function by the dimensional variables.
Package used to preAggregate.
Parameter pkg
to aggregate_by_pkg
.
Whether to include NAs in the grouping variables while preAggregate.
Parameter include_na
to aggregate_by_pkg
.
Parameter base_order
to aggregate_by_pkg
, used when preAggregate.
The default is set to FALSE
to avoid unnecessary sorting operations.
When TRUE
, an attempt is made to return the same result with data.table
as with base R.
This cannot be guaranteed due to potential variations in sorting behavior across different systems.
Parameter pkg
to RowGroups
.
The parameter is input to Formula2ModelMatrix
via ModelMatrix
.
Further parameters sent to RoundViaDummy
This function is a user-friendly wrapper for RoundViaDummy
with data frame output and with computed summary of the results.
See RoundViaDummy
for more details.
Langsrud, Ø. and Heldal, J. (2018): “An Algorithm for Small Count Rounding of Tabular Data”. Presented at: Privacy in statistical databases, Valencia, Spain. September 26-28, 2018. https://www.researchgate.net/publication/327768398_An_Algorithm_for_Small_Count_Rounding_of_Tabular_Data
RoundViaDummy
, PLS2way
, ModelMatrix
# Small example data set
z <- SmallCountData("e6")
print(z)
# Publishable cells by formula interface
a <- PLSrounding(z, "freq", roundBase = 5, formula = ~geo + eu + year)
print(a)
print(a$inner)
print(a$publish)
print(a$metrics)
print(a$freqTable)
# Using FormulaSelection()
FormulaSelection(a$publish, ~eu + year)
FormulaSelection(a, ~eu + year) # same as above
FormulaSelection(a) # just a$publish
# Recalculation of maxdiff, HDutility, meanAbsDiff and rootMeanSquare
max(abs(a$publish[, "difference"]))
HDutility(a$publish[, "original"], a$publish[, "rounded"])
mean(abs(a$publish[, "difference"]))
sqrt(mean((a$publish[, "difference"])^2))
# Five lines below produce equivalent results
# Ordering of rows can be different
PLSrounding(z, "freq", dimVar = c("geo", "eu", "year"))
PLSrounding(z, "freq", formula = ~eu * year + geo * year)
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eHrc"))
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eDimList"))
PLSrounding(z[, -2], "freq", hierarchies = SmallCountData("eDimList"), formula = ~geo * year)
# Define publishable cells differently by making use of formula interface
PLSrounding(z, "freq", formula = ~eu * year + geo)
# Define publishable cells differently by making use of hierarchy interface
eHrc2 <- list(geo = c("EU", "@Portugal", "@Spain", "Iceland"), year = c("2018", "2019"))
PLSrounding(z, "freq", hierarchies = eHrc2)
# Also possible to combine hierarchies and formula
PLSrounding(z, "freq", hierarchies = SmallCountData("eDimList"), formula = ~geo + year)
# Single data frame output
PLSroundingInner(z, "freq", roundBase = 5, formula = ~geo + eu + year)
PLSroundingPublish(z, roundBase = 5, formula = ~geo + eu + year)
# Microdata input
PLSroundingInner(rbind(z, z), roundBase = 5, formula = ~geo + eu + year)
# Zero perturbed due to both extend0 = TRUE and zeroCandidates = TRUE
set.seed(12345)
PLSroundingInner(z[sample.int(5, 12, replace = TRUE), 1:3],
formula = ~geo + eu + year, roundBase = 5,
extend0 = TRUE, zeroCandidates = TRUE, printInc = TRUE)
# Parameter avoidHierarchical (see RoundViaDummy and ModelMatrix)
PLSroundingPublish(z, roundBase = 5, formula = ~geo + eu + year, avoidHierarchical = TRUE)
# To illustrate hierarchical_extend0
# (parameter to underlying function, SSBtools::Extend0fromModelMatrixInput)
PLSroundingInner(z[-c(2:3), ], roundBase = 5, formula = ~geo + eu + year,
avoidHierarchical = TRUE, zeroCandidates = TRUE, extend0 = TRUE)
PLSroundingInner(z[-c(2:3), ], roundBase = 5, formula = ~geo + eu + year,
avoidHierarchical = TRUE, zeroCandidates = TRUE, extend0 = TRUE,
hierarchical_extend0 = TRUE)
# Package sdcHierarchies can be used to create hierarchies.
# The small example code below works if this package is available.
if (require(sdcHierarchies)) {
z2 <- cbind(geo = c("11", "21", "22"), z[, 3:4], stringsAsFactors = FALSE)
h2 <- list(
geo = hier_compute(inp = unique(z2$geo), dim_spec = c(1, 1), root = "Tot", as = "df"),
year = hier_convert(hier_create(root = "Total", nodes = c("2018", "2019")), as = "df"))
PLSrounding(z2, "freq", hierarchies = h2)
}
# Use PLS2way to produce tables as in Langsrud and Heldal (2018) and to demonstrate
# parameters maxRound, zeroCandidates and identifyNew (see RoundViaDummy).
# Parameter rndSeed used to ensure same output as in reference.
exPSD <- SmallCountData("exPSD")
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, rndSeed=124)
PLS2way(a, "original") # Table 1
PLS2way(a) # Table 2
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, identifyNew = FALSE, rndSeed=124)
PLS2way(a) # Table 3
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, maxRound = 7)
PLS2way(a) # Values in col1 rounded
a <- PLSrounding(exPSD, "freq", 5, formula = ~rows + cols, zeroCandidates = TRUE)
PLS2way(a) # (row3, col4): original is 0 and rounded is 5
# Using formula followed by FormulaSelection
output <- PLSrounding(data = SmallCountData("example1"),
formula = ~age * geo * year + eu * year,
freqVar = "freq",
roundBase = 5)
FormulaSelection(output, ~(age + eu) * year)
# Example similar to the one in the documentation of tables_by_formulas,
# but using PLSroundingPublish with roundBase = 4.
tables_by_formulas(SSBtoolsData("magnitude1"),
table_fun = PLSroundingPublish,
table_formulas = list(table_1 = ~region * sector2,
table_2 = ~region1:sector4 - 1,
table_3 = ~region + sector4 - 1),
substitute_vars = list(region = c("geo", "eu"), region1 = "eu"),
collapse_vars = list(sector = c("sector2", "sector4")),
roundBase = 4)
Run the code above in your browser using DataLab