tune_classification_model: Privacy-preserving Hyperparameter Tuning for Binary Classification Models

Description

This function implements the privacy-preserving hyperparameter tuning function for binary classification chaudhuri2011DPpack using the exponential mechanism. It accepts a list of models with various chosen hyperparameters, a dataset X with corresponding labels y, upper and lower bounds on the columns of X, and a boolean indicating whether to add bias in the construction of each of the models. The data are split into m+1 equal groups, where m is the number of models being compared. One group is set aside as the validation group, and each of the other m groups are used to train each of the given m models. The number of errors on the validation set is counted for each model and used as the utility values in the exponential mechanism (ExponentialMechanism) to select a tuned model in a privacy-preserving way.

Usage

tune_classification_model(
  models,
  X,
  y,
  upper.bounds,
  lower.bounds,
  add.bias = FALSE,
  weights = NULL,
  weights.upper.bound = NULL
)

Value

Single model object selected from the input list models with tuned parameters.

Arguments

models: Vector of binary classification model objects, each initialized with a different combination of hyperparameter values from the search space for tuning. Each model should be initialized with the same epsilon privacy parameter value eps. The tuned model satisfies eps-level differential privacy.
X: Dataframe of data to be used in tuning the model. Note it is assumed the data rows and corresponding labels are randomly shuffled.
y: Vector or matrix of true labels for each row of X.
upper.bounds: Numeric vector giving upper bounds on the values in each column of X. Should be of length ncol(X). The values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
lower.bounds: Numeric vector giving lower bounds on the values in each column of X. Should be of length ncol(X). The values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X smaller than the corresponding lower bound is clipped at the bound.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE.
weights: Numeric vector of observation weights of the same length as y.
weights.upper.bound: Numeric value representing the global or public upper bound on the weights.

References

chaudhuri2011DPpack

Examples

Run this code

# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
  t <- seq(-.25,.25,length.out = N)
  if (j==1) m <- stats::rnorm(N,-.2,.1)
  if (j==2) m <- stats::rnorm(N, .2,.1)
  Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
  ytemp <- data.frame(matrix(j-1, N, 1))
  X <- rbind(X, Xtemp)
  y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]
y <- as.matrix(y)
weights <- rep(1, nrow(y)) # Uniform weighting
weights[nrow(y)] <- 0.5 # half weight for last observation
wub <- 1 # Public upper bound for weights

# Grid of possible gamma values for tuning logistic regression model
grid.search <- c(100, 1, .0001)

# Construct objects for SVM parameter tuning
eps <- 1 # Privacy budget should be the same for all models
svmdp1 <- svmDP$new("l2", eps, grid.search[1], perturbation.method='output')
svmdp2 <- svmDP$new("l2", eps, grid.search[2], perturbation.method='output')
svmdp3 <- svmDP$new("l2", eps, grid.search[3], perturbation.method='output')
models <- c(svmdp1, svmdp2, svmdp3)

# Tune using data and bounds for X based on its construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
tuned.model <- tune_classification_model(models, X, y, upper.bounds,
                                         lower.bounds, weights=weights,
                                         weights.upper.bound=wub)
tuned.model$gamma # Gives resulting selected hyperparameter

# tuned.model result can be used the same as a trained LogisticRegressionDP model
# Predict new data points
predicted.y <- tuned.model$predict(Xtest)
n.errors <- sum(predicted.y!=ytest)

Run the code above in your browser using DataLab