microaggregation: Microaggregation

Description

Function to perform various methods of microaggregation.

Usage

microaggregation(obj,variables=NULL,aggr=3,strata_variables=NULL,method="mdav",
  weights=NULL, nc = 8, clustermethod = "clara", opt = FALSE, measure = "mean",
  trim = 0, varsort = 1, transf = "log")

Arguments

obj

either an object of class sdcMicroObj or a data frame or matrix

variables

variables to microaggregate. For NULL:If obj is of class sdcMicroObj the categorical key variables are chosen per default. For data.frames and matrices all columns are chosen per default.

aggr

aggregation level (default=3)

strata_variables

by-variables for applying microaggregation only within strata defined by the variables

method

pca, rmd, onedims, single, simple, clustpca, pppca, clustpppca, mdav, clustmcdpca, influence, mcdpca

number of cluster, if the chosen method performs cluster analysis

weights

sampling weights. If obj is of class sdcMicroObj the vector of sampling weights is chosen automatically. If determined, a weighted version of the aggregation measure is chosen automatically, e.g. weighted median or weighted mean.

clustermethod

clustermethod, if necessary

opt

experimental

measure

aggregation statistic, mean, median, trim, onestep (default = mean)

trim

trimming percentage, if measure=trim

varsort

variable for sorting, if method= single

transf

transformation for data x

Value

If obj was of class sdcMicroObj the corresponding slots are filled, like manipNumVars, risk and utility. If obj was of class data.frame or matrix an object of class micro with following entities is returned:
mxthe aggregated data
xoriginal data
methodmethod
aggraggregation level
measureproximity measure for aggregation
fotcorrection factor, necessary if totals calculated and n divided by aggr is not an integer.

Details

On http://neon.vb.cbs.nl/casc/Glossary.htm one can found the official definition of microaggregation: Records are grouped based on a proximity measure of variables of interest, and the same small groups of records are used in calculating aggregates for those variables. The aggregates are released instead of the individual record values. The recommended method is rmd which forms the proximity using multivariate distances based on robust methods. It is an extension of the well-known method mdav. However, when computational speed is important, method mdav is the preferable choice. While for the proximity measure very different concepts can be used, the aggregation itself is naturally done with the arithmetic mean. Nevertheless, other measures of location can be used for aggregation, especially when the group size for aggregation has been taken higher than 3. Since the median seems to be unsuitable for microaggregation because of being highly robust, other mesures which are included can be chosen. If a complex sample survey is microaggregated, the corresponding sampling weights should be determined to either aggregate the values by the weighted arithmetic mean or the weighted median. This function contains also a method with which the data can be clustered with a variety of different clustering algorithms. Clustering observations before applying microaggregation might be useful. Note, that the data are automatically standardised before clustering. The usage of clustering method Mclust requires package mclust02, which must be loaded first. The package is not loaded automatically, since the package is not under GPL but comes with a different licence. The are also some projection methods for microaggregation included. The robust version pppca or clustpppca (clustering at first) are fast implementations and provide almost everytime the best results. Univariate statistics are preserved best with the individual ranking method (we called them onedims, however, often this method is named individual ranking), but multivariate statistics are strong affected. With method simple one can apply microaggregation directly on the (unsorted) data. It is useful for the comparison with other methods as a benchmark, i.e. replies the question how much better is a sorting of the data before aggregation.

References

http://www.springerlink.com/content/v257655u88w2/?sortorder=asc&p_o=20 Templ, M. and Meindl, B., Robust Statistics Meets {SDC}: New Disclosure Risk Measures for Continuous Microdata Masking, Lecture Notes in Computer Science, Privacy in Statistical Databases, vol. 5262, pp. 113-126, 2008. Templ, M. Statistical Disclosure Control for Microdata Using the R-Package sdcMicro, Transactions on Data Privacy, vol. 1, number 2, pp. 67-85, 2008. http://www.tdp.cat/issues/abs.a004a08.php Templ, M. New Developments in Statistical Disclosure Control and Imputation: Robust Statistics Applied to Official Statistics, Suedwestdeutscher Verlag fuer Hochschulschriften, 2009, ISBN: 3838108280, 264 pages. Templ, M. and Meindl, B.: Practical Applications in Statistical Disclosure Control Using R, Privacy and Anonymity in Information Management Systems New Techniques for New Practical Problems, Springer, 31-62, 2010, ISBN: 978-1-84996-237-7.

Examples

Run this code

data(Tarragona)
m1 <- microaggregation(Tarragona, method="onedims", aggr=3)
## summary(m1)
data(testdata)
m2 <- microaggregation(testdata[1:100,c("expend","income","savings")],
  method="mdav", aggr=4)
summary(m2)

## for objects of class sdcMicro:
data(testdata2)
sdc <- createSdcObj(testdata2,
  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'), 
  numVars=c('expend','income','savings'), w='sampling_weight')
sdc <- microaggregation(sdc)

Run the code above in your browser using DataLab