CDFtest: Comprehensively evaluate and visualize the utility of CDF-generating implementations.

Description

The suite is a system for determining the utility of differentially private cumulative distribution function (DP-CDF) algorithm implementations. The system can empirically evaluate and provide visualizations for several DP-CDF algorithms simultaneously, under various parameters. It can also be set to focus strictly on data collection, rather than spending time on visualization. It comes with several pre-loaded adjustable synthetic datasets, and can also analyze functions on user-defined datasets. dpCDF implementations to test must take the following as arguments: data, epsilon, granularity, range, and any number of other inputs. Use "?functionH" for an example of an implementation drawing on C++ files through Rcpp. USERS SHOULD NOTE: the following included diagnostic functions are under development: SkewDiffpdf,KurtDiffpdf, StdDiffpdf, corresponding to error measurements of skewness, kurtoses, and standard deviations generated from dpCDFs. This is evident through the occasional result of NA.

Usage

CDFtest(Visualization = TRUE, OutputDirectory = 0, functlist, Fnameslist, epslist = c(0.05, 0.1, 1), datalist, Dnameslist, synthsets = NULL, range, gran = 1, granlist = c(1), samplesize = 0, nlist = (10000), cdfstep = 1, reps = 5, ExtraTests_CDF = list(), ExtraTests_PDF = list(), setseed = -100, comments = "none", SmoothAll = FALSE, EmpiricBounds = FALSE, AnalyticBounds = FALSE, AnalyticProbSleeve = FALSE, SuppressRealCDF = FALSE, SuppressDPCDF = FALSE, SuppressLegends = FALSE, ...)

Arguments

Visualization

Sets the testing suite into Visualization mode (default, Visualization = TRUE) or Data Collection mode (Visualization = FALSE) In Visualization mode (default): A .csv file conatining the mean and median results (across reps iterations) of diagnostic functions on DP-CDF algorithms per each combination of data, function, and epsilon. A .pdf file containing one graphical example DP CDF for each combination of dataset, function, and epsilon, as well as a set of boxplots showing the distribution of all diagnostic results for all combinations of parameters. In Data Collection mode (set Visualization = FALSE): A .csv file containing the entire (raw) results (across reps iterations) of diagnostic functions on DP-CDF algorithms per each combination of dataset, and function, seperately looped over all epsilons, then all granularities, and all samplesizes.

OutputDirectory

The location of the folder which will hold the output (.csv and .pdf files). This defaults to the tempdir() directory.

functlist

A list of CDF-computing functions to be tested on the CDFtestTrack (if visualization = TRUE) or CDFtestTrackx (if Visualization =FALSE))

Fnameslist

A vector of function names corresponding to the functions

epslist

A vector of epsilon values for differential privacy

datalist

A list containing vectors of data, each to be used in a test

Dnameslist

A list of dataset names corresponding to the data/variables being tested; used for labelling the output

synthsets

This script generates pre-defined synthetic datasets upon request, and fully incorporates them into testing. To call them, users should input a string vector containing the names of the sets they desire. For example, synthsets = list(list(type,size,shape),list(type,size,shape)). There are no limits on the amounts of datasets included. Sets available include: type: "age" (which ranges from about 0 to 100, gran =1) and "wage" which ranges from 0 to 500k); size: Any positive integer. Type in exact numerical representation (eg, for ten thousand use 10000 not 10k and not 10^4); shape: gaussian, sparse, uniform, bimodal; It is assumed that the data input is rounded to the granularity

range

The range of the domain as a vector c(min, max). Defined based on user intuition. to preserve differential privacy, the domain is constructed using this range. Setting the min too high will bias output upward. Same in reverse for a low max. However, setting min too low and max too high could reveal the true limits of your data, compromising some privacy.

gran

FOR Visualization MODE ONLY. refer to granlist for setting granularities (thus domain sizes) in Data Collection mode. This command is irrelevant in Data Collection mode. The granularity of the domain between the min and max. ie, if age is measureds per 1 year of age, gran =1. The same granularity is applied to all datasets, so using comparable (or scaled) data is necessary.

granlist

FOR Data Collection MODE ONLY. refer to gran for selecting samplesizes in Data Collection mode. This command is irrelevant in Visualization mode. A list of granularities of the domain between the min and max. ie, if age is measure per 1 year of age, gran =1.

samplesize

FOR Visualization MODE ONLY. refer to nlist for selecting samplesizes in Data Collection mode. This command is irrelevant in Data Collection mode. when set to zero, the entire dataset is used. Otherwise, the specified sample size is randomly selected from each dataset without replacement.

nlist

FOR Data Collection MODE ONLY. refer to samplesize for selecting samplesizes in visualization mode. This command is irrelevant in Visualization mode. Sets the absolute sample sizes to draw from each dataset, with replacement. Any vector of integer values is appropriate.

cdfstep

The step size used in outputting the approximate CDF;

reps

The number of times to repeat each diagnostic. higher reps lends greater accuracy, but comsumes time and power. Author recommends reps = 10 for quick examples and reps = 100 for more robust examinations.

ExtraTests_CDF

If a user wishes to add extra diagnostics, the proper ExtraTests_CDF = list(functionName1=function1, functionName2=function2). Diagnostic Functions should have inputs such as Y for a public CDF, est for a DP-representation of that CDF, range and gran, and the output should be just one value.

ExtraTests_PDF

See above

setseed

In the function, each combination of data, epsilon, and function is executed with a separate seed, which by default is randomly generated and reported. Users interested in replicating specific results can locate the reported seed and parameter combination to replicate tests.

comments

"Comments written here print to a log in excel"

SmoothAll

Applies L2 monotnocity post-processing to every DP-CDF

EmpiricBounds

FOR Visualization MODE ONLY. When TRUE, outputted graphs depict the minimum and maximum values taken by each bin across reps

AnalyticBounds

FOR Visualization MODE ONLY. This is a flag and should be set to TRUE if the functions being tested are expected to output analytical variance bounds. The proper output form for such a function is output = list(DPCDFvector, LowerBoundVector, UpperBoundVector).

AnalyticProbSleeve

FOR Visualization MODE ONLY. When TRUE, outputted DP-CDFs will have a 'fuzzy' analytic sleeve around them, approximating probabalitity density for each point given by DP. This also requires the function format specified above in the description for AnalyticBounds.

SuppressRealCDF

FOR Visualization MODE ONLY. When TRUE, outputted graphs will not include real (non-private) CDFs.

SuppressDPCDF

FOR Visualization MODE ONLY. When TRUE, outputted graphs will not include DP-CDFs (but if SmoothAll = TRUE, monotonized DP CDFs still appear).

SuppressLegends

FOR Visualization MODE ONLY. When TRUE, outputted graphs will not include legends

...

Optionally add additional parameters. This is primarily used to allow automated execution of varied diagnostic functions.

Value

If Visualization = TRUE, a list containing:...$means Contains mean diagnostic results for each diagnostic across reps iterations for each parameter combination;..$medians Contains median diagnostic results for each diagnostic across reps iterations for each parameter combination;\...$yourCDFoutput Containing a single dpCDF iteration for each parameter combination;\...$yourPDFoutput Containing a single dpPDF iteration for each parameter combination;\...$realCDFoutput Containing the real (non-DP) CDF output for each relevant parameter combination;...$realPDFoutput Containing the real (non-DP) PDF output for each relevant parameter combination;...$databins Containing the domain used to construct the CDFs;...$TestPack_CDF Containing the definitions of diagnostic functions used on dpCDFs;...$TestPack_PDF Containing the definitions of diagnostic functions used on dpPDFs;...$allscores Containing all raw diagnostic output....$seed Containing the list of seeds used in the test...$permetric holding a rearranged dataframe (ordered by parameter combinations) useful for plotting.A .pdf file: with boxplots showing the distributions of diagnostic outputs, and categorized plots of dp-CDF function output. Each such graph with show one arbitrary CDF iterations and empirical boundaries. the empirical boundaries are the max and min values reached by that function (and parameters) during the test.A .csv file: containing the mean and median scores of each diagnostic on each combination of data, eps, function, and the seedlist for reproduction.Notes on Visualization mode: Both the .pdf and .csv components are named with a time stamp index, in the form of YearMonthDayHourMinuteSecond. To locate particular tests, look at the CDFtestindexchart.csv, which automatically records the parameters and index of each test. These can be found in the file specified by OutputDirectory, which defaults to the R temp files tempdir().Alternatively in Data Collection mode (Visualization = FALSE), a list containing:...$allscores holding the output of each combination of parameters, which is that each eps in epslist is varied across the first value specified in granlist and nlist. The same is true for varying granularity and sample size. In that way, only one variable is varied at a time while the other two are held fixed. All such combinations of parameters are executed on all combinations of data and function (specified within ...datalist and functlist);...$seed holding the list of seeds used in the test.A .csv file conatining the entire (raw) results (across reps iterations) of diagnostic functions on DP-CDF algorithms per each combination of dataset, and function, looped over epsilon, granularity, and sample size values as described directly above.\ This mode was designed for collecting metric data for subsequent supervised learning modelling.

Examples

Run this code

CDFtest( Visualization = TRUE,OutputDirectory = 0, functlist = c(functionH),
Fnameslist = c("H"), epslist  = c(.1, .01), datalist = list(),
Dnameslist = c(), synthsets= list(list("wage", 100000, "uniform"), 
 list("wage",100000,"sparse"), list("wage",100000,"bimodal")),
 range    = c(1,500000),gran =1000,granlist =c(2500,1250,1000,500), 
 samplesize = 0,nlist = c(100,1000,10000,100000,1000000),
 cdfstep  =0, reps = 5,  ExtraTests_CDF = list(),ExtraTests_PDF = list(),
 setseed = c(-100),
 comments = "x",SmoothAll = FALSE,EmpiricBounds = FALSE,
 AnalyticBounds = FALSE,AnalyticProbSleeve = FALSE,
 SuppressRealCDF = FALSE,SuppressDPCDF = FALSE,SuppressLegends = FALSE)

Run the code above in your browser using DataLab