createDat: Dummy Dataset for Record Swapping

Description

[createDat()] returns dummy data to illustrate targeted record swapping. The generated data contain household ids (`hid`), geographic variables (`nuts1`, `nuts2`, `nuts3`, `lau2`) as well as some other household or personal variables.

Applies targeted record swapping on micro data considering the identification risk of each record as well the geographic topology.

Usage

createDat(N = 10000)
recordSwap(data, ...)
# S3 method for sdcMicroObj
recordSwap(data, ...)
# S3 method for default
recordSwap(
  data,
  hid,
  hierarchy,
  similar,
  swaprate = 0.05,
  risk = NULL,
  risk_threshold = 0,
  k_anonymity = 3,
  risk_variables = NULL,
  carry_along = NULL,
  return_swapped_id = FALSE,
  log_file_name = "TRS_logfile.txt",
  seed = NULL,
  ...
)

Value

`data.table` containing dummy data

`data.table` with swapped records.

Arguments

N: integer, number of household to generate
data: must be either a micro data set in the form of a `data.table` or `data.frame`, or an `sdcObject`, see createSdcObj.
...: parameters passed to `recordSwap.default()`
hid: column index or column name in `data` which refers to the household identifier.
hierarchy: column indices or column names of variables in `data` which refer to the geographic hierarchy in the micro data set. For instance county > municipality > district.
similar: vector or list of integer vectors or column names containing similarity profiles, see details for more explanations.
swaprate: double between 0 and 1 defining the proportion of households which should be swapped, see details for more explanations
risk: either column indices or column names in `data` or `data.table`, `data.frame` or `matrix` indicating risk of each record at each hierarchy level. If `risk`-matrix is supplied to swapping procedure will not use the k-anonymity rule but the values found in this matrix for swapping. ATTENTION: This is NOT fully implemented yet and currently ignored by the underlying c++ functions until tested properly
risk_threshold: single numeric value indicating when a household is considered "high risk", e.g. when this household must be swapped. Is only used when `risk` is not `NULL`. ATTENTION: This is NOT fully implemented yet and currently ignored by the underlying c++ functions until tested properly
k_anonymity: integer defining the threshold of high risk households (counts<k) for using k-anonymity rule
risk_variables: column indices or column names of variables in `data` which will be considered for estimating the risk. Only used when k-anonymity rule is applied.
carry_along: integer vector indicating additional variables to swap besides to hierarchy variables. These variables do not interfere with the procedure of finding a record to swap with or calculating risk. This parameter is only used at the end of the procedure when swapping the hierarchies. We note that the variables to be used as `carry_along` should be at household level. In case it is detected that they are at individual level (different values within `hid`), a warning is given.
return_swapped_id,: boolean if `TRUE` the output includes an additional column showing the `hid` with which a record was swapped with. The new column will have the name `paste0(hid,"_swapped")`.
log_file_name: character, path for writing a log file. The log file contains a list of household IDs (`hid`) which could not have been swapped and is only created if any such households exist.
seed: integer defining the seed for the random number generator, for reproducibility. if `NULL` a random seed will be set using `sample(1e5,1)`.

Author

Johannes Gussenbauer

Details

The procedure accepts a `data.frame` or `data.table` containing all necessary information for the record swapping, e.g parameter `hid`, `similar`, `hierarchy`, etc ... First the micro data in `data` is ordered by `hid` and the identification risk is calculated for each record in each hierarchy level. As of right now only counts is used as identification risk and the inverse of counts is used as sampling probability. NOTE: It will be possible to supply an identification risk for each record and hierarchy level which will be passed down to the C++-function. This is however not fully implemented.

With the parameter `k_anonymity` a k-anonymity rule is applied to define risky households in each hierarchy level. A household is set to risky if counts < k_anonymity in any hierarchy level and the household needs to be swapped across this hierarchy level. For instance having a geographic hierarchy of NUTS1 > NUTS2 > NUTS3 the counts are calculated for each geographic variable and defined `risk_variables`. If the counts for a record falls below `k_anonymity` for hierarchy county then this record needs to be swapped across counties. Setting `k_anonymity = 0` disables this feature and no risky households are defined.

After that the targeted record swapping is applied starting from the highest to the lowest hierarchy level and cycling through all possible geographic areas at each hierarchy level, e.g every county, every municipality in every county, etc, ...

At each geographic area a set of values is created for records to be swapped. In all but the lowest hierarchy level this is ONLY made out of all records which do not fulfill the k-anonymity and have not already been swapped. Those records are swapped with records not belonging to the same geographic area, which have not already been swapped beforehand. Swapping refers to the interchange of geographic variables defined in `hierarchy`. When a record is swapped all other record containing the same `hid` are swapped as well.

At the lowest hierarchy level in every geographic area the set of records to be bswapped is made up of all records which do not fulfill the k-anonymity as well as the remaining numer of records such that the proportion of swapped records of the geographic area is in coherence with the `swaprate`. If, due to the k-anonymity condition, more records have already been swapped in this geographic area then only the records which do not fulfill the k-anonymity are swapped.

Using the parameter `similar` one can define similarity profiles. `similar` needs to be a list of vectors with each list entry containing column indices of `data`. These entries are used when searching for donor households, meaning that for a specific record the set of all donor records is made out of records which have the same values in `similar[[1]]`. It is however important to note, that these variables can only be variables related to households (not persons!). If no suitable donor can be found the next similarity profile is used, `similar[[2]]` and the set of all donors is then made up out of all records which have the same values in the column indices in `similar[[2]]`. This procedure continues until a donor record was found or all the similarity profiles have been used.

`swaprate` sets the swaprate of households to be swapped, where a single swap counts for swapping 2 households, the sampled household and the corresponding donor. Prior to the procedure the swaprate is applied on the lowest hierarchy level, to determine the target number of swapped households in each of the lowest hierarchies. If the target numbers of a decimal point they will randomly be rounded up or down such that the number of households swapped in total is in coherence to the swaprate.

Examples

Run this code

# generate 10000 dummy households
library(data.table)
seed <- 2021
set.seed(seed)
nhid <- 10000
dat <- sdcMicro::createDat(nhid)

# define paramters for swapping
k_anonymity <- 1
swaprate <- .05 # 5%
similar <- list(c("hsize"))
hier <- c("nuts1", "nuts2")
risk_variables <- c("ageGroup", "national")
hid <- "hid"

# apply record swapping
dat_s <- recordSwap(
  data = dat,
  hid = hid,
  hierarchy = hier,
  similar = similar,
  swaprate = swaprate,
  k_anonymity = k_anonymity,
  risk_variables = risk_variables,
  carry_along = NULL,
  return_swapped_id = TRUE,
  seed = seed
)

# number of swapped households
dat_s[hid != hid_swapped, uniqueN(hid)]

# hierarchies are not consistently swapped
dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]

# use parameter carry_along
dat_s <- recordSwap(
  data = dat,
  hid = hid,
  hierarchy = hier,
  similar = similar,
  swaprate = swaprate,
  k_anonymity = k_anonymity,
  risk_variables = risk_variables,
  carry_along = c("nuts3", "lau2"),
  return_swapped_id = TRUE,
  seed = seed)

dat_s[hid != hid_swapped, .(nuts1, nuts2, nuts3, lau2)]

Run the code above in your browser using DataLab