data, epsilon, granularity, range
, and any number of other inputs.
Use "?functionH" for an example of an implementation drawing on C++ files
through Rcpp.
USERS SHOULD NOTE: the
following included diagnostic functions are under development:
SkewDiffpdf,KurtDiffpdf, StdDiffpdf
, corresponding to error measurements of
skewness, kurtoses, and standard deviations generated from dpCDFs.
This is evident through the occasional result of NA
.
CDFtest(Visualization = TRUE, OutputDirectory = 0, functlist, Fnameslist, epslist = c(0.05, 0.1, 1), datalist, Dnameslist, synthsets = NULL, range, gran = 1, granlist = c(1), samplesize = 0, nlist = (10000), cdfstep = 1, reps = 5, ExtraTests_CDF = list(), ExtraTests_PDF = list(), setseed = -100, comments = "none", SmoothAll = FALSE, EmpiricBounds = FALSE, AnalyticBounds = FALSE, AnalyticProbSleeve = FALSE, SuppressRealCDF = FALSE, SuppressDPCDF = FALSE, SuppressLegends = FALSE, ...)
Visualization = TRUE
)
or Data Collection mode (Visualization = FALSE)
In Visualization mode (default):
A .csv
file conatining the mean and median results (across reps
iterations) of diagnostic functions on DP-CDF algorithms per each
combination of data, function, and epsilon.
A .pdf
file containing one graphical example DP CDF for each combination
of dataset, function, and epsilon, as well as a set of boxplots
showing the distribution of all diagnostic results for all
combinations of parameters.
In Data Collection mode (set Visualization = FALSE
):
A .csv
file containing the entire (raw) results (across reps
iterations) of diagnostic functions on DP-CDF algorithms
per each combination of dataset, and function, seperately looped over
all epsilons, then all granularities, and all samplesizes..csv
and .pdf
files). This defaults to the
tempdir()
directory.CDFtestTrack
(if visualization = TRUE
) or CDFtestTrackx
(if Visualization =FALSE
))synthsets = list(list(type,size,shape),list(type,size,shape))
.
There are no limits on the amounts of datasets included.
Sets available include:
type: "age"
(which ranges from about 0 to 100, gran =1
)
and "wage"
which ranges from 0 to 500k);
size: Any positive integer. Type in exact numerical representation
(eg, for ten thousand use 10000 not 10k and not 10^4);
shape: gaussian, sparse, uniform, bimodal;
It is assumed that the data input is rounded to the granularityc(min, max)
. Defined based
on user intuition. to preserve differential privacy, the domain is constructed
using this range. Setting the min too high will bias output upward.
Same in reverse for a low max. However, setting min too low and max
too high could reveal the true limits of your data,
compromising some privacy.granlist
for setting
granularities (thus domain sizes) in Data Collection mode.
This command is irrelevant in Data Collection mode.
The granularity of the domain between the min and max. ie, if age is
measureds per 1 year of age, gran =1
.
The same granularity is applied to all datasets, so using comparable
(or scaled) data is necessary.gran
for selecting
samplesizes in Data Collection mode. This command is irrelevant in
Visualization mode. A list of granularities of the domain between the
min and max. ie, if age is measure per 1 year of age, gran =1
.nlist
for selecting
samplesizes in Data Collection mode. This command is irrelevant in
Data Collection mode. when set to zero, the entire dataset is used.
Otherwise, the specified sample size is randomly selected from each dataset
without replacement.samplesize
for
selecting samplesizes in visualization mode. This command is irrelevant in
Visualization mode. Sets the absolute sample sizes to draw from each
dataset, with replacement. Any vector of integer values is appropriate.reps
lends
greater accuracy, but comsumes time and power. Author recommends reps = 10
for quick examples and reps = 100
for more robust examinations.ExtraTests_CDF = list(functionName1=function1, functionName2=function2)
.
Diagnostic Functions should have inputs such as Y
for a public CDF,
est
for a DP-representation of that CDF,
range
and gran
, and the output should be just one value.TRUE
if the functions being tested are expected to output
analytical variance bounds. The proper output form for such a function is
output = list(DPCDFvector, LowerBoundVector, UpperBoundVector)
.TRUE
, outputted
DP-CDFs will have a 'fuzzy' analytic sleeve around them, approximating
probabalitity density for each point given by DP. This also requires the
function format specified above in the description for AnalyticBounds
.TRUE
, outputted
graphs will not include real (non-private) CDFs.TRUE
, outputted
graphs will not include DP-CDFs (but if SmoothAll = TRUE
, monotonized
DP CDFs still appear).TRUE
, outputted
graphs will not include legendsVisualization = TRUE
, a list containing:...$means
Contains mean diagnostic
results for each diagnostic across reps iterations for each parameter combination;..$medians
Contains median diagnostic
results for each diagnostic across reps iterations for each parameter combination;\...$yourCDFoutput
Containing a single dpCDF iteration for each parameter combination;\...$yourPDFoutput
Containing a single dpPDF iteration for each parameter combination;\...$realCDFoutput
Containing the real (non-DP) CDF output for each relevant parameter combination;...$realPDFoutput
Containing the real (non-DP) PDF output for each relevant parameter combination;...$databins
Containing the domain used to construct the CDFs;...$TestPack_CDF
Containing the definitions of diagnostic functions used on dpCDFs;...$TestPack_PDF
Containing the definitions of diagnostic functions used on dpPDFs;...$allscores
Containing all raw diagnostic output....$seed
Containing the list of seeds used in the test...$permetric
holding a rearranged dataframe (ordered by parameter
combinations) useful for plotting.A .pdf
file:
with boxplots showing the distributions of diagnostic outputs,
and categorized plots of dp-CDF function output. Each such graph with
show one arbitrary CDF iterations and empirical boundaries.
the empirical boundaries are the max and min values reached by that
function (and parameters) during the test.A .csv
file:
containing the mean and median scores of each diagnostic on each
combination of data, eps, function, and the seedlist for reproduction.Notes on Visualization mode: Both the .pdf
and .csv
components are named with a time stamp index,
in the form of YearMonthDayHourMinuteSecond
. To locate particular tests,
look at the CDFtestindexchart.csv
, which automatically records the
parameters and index of each test. These can be found in the file specified by
OutputDirectory
, which defaults to the R temp files tempdir()
.Alternatively in Data Collection mode (Visualization = FALSE
), a list containing:...$allscores
holding the output of each combination of parameters,
which is that each eps
in epslist is varied across the first value
specified in granlist
and nlist
. The same is true for varying
granularity and sample size. In that way, only one variable is varied at a time
while the other two are held fixed. All such combinations of parameters are
executed on all combinations of data and function (specified within
...datalist
and functlist
);...$seed
holding the list of seeds used in the test.A .csv
file conatining the entire (raw) results (across reps iterations)
of diagnostic functions on DP-CDF algorithms per each combination of
dataset, and function, looped over epsilon, granularity, and sample size values
as described directly above.\
This mode was designed for collecting metric data for subsequent supervised
learning modelling.
CDFtest( Visualization = TRUE,OutputDirectory = 0, functlist = c(functionH),
Fnameslist = c("H"), epslist = c(.1, .01), datalist = list(),
Dnameslist = c(), synthsets= list(list("wage", 100000, "uniform"),
list("wage",100000,"sparse"), list("wage",100000,"bimodal")),
range = c(1,500000),gran =1000,granlist =c(2500,1250,1000,500),
samplesize = 0,nlist = c(100,1000,10000,100000,1000000),
cdfstep =0, reps = 5, ExtraTests_CDF = list(),ExtraTests_PDF = list(),
setseed = c(-100),
comments = "x",SmoothAll = FALSE,EmpiricBounds = FALSE,
AnalyticBounds = FALSE,AnalyticProbSleeve = FALSE,
SuppressRealCDF = FALSE,SuppressDPCDF = FALSE,SuppressLegends = FALSE)
Run the code above in your browser using DataLab