kmm: Kernel mean matching approach to density ratio estimation

Description

Kernel mean matching approach to density ratio estimation

Usage

kmm(
  df_numerator,
  df_denominator,
  scale = "numerator",
  constrained = FALSE,
  nsigma = 10,
  sigma_quantile = NULL,
  sigma = NULL,
  ncenters = 200,
  centers = NULL,
  cv = TRUE,
  nfold = 5,
  parallel = FALSE,
  nthreads = NULL,
  progressbar = TRUE,
  osqp_settings = NULL,
  cluster = NULL
)

Value

kmm-object, containing all information to calculate the density ratio using optimal sigma and optimal weights.

Arguments

df_numerator: data.frame with exclusively numeric variables with the numerator samples
df_denominator: data.frame with exclusively numeric variables with the denominator samples (must have the same variables as df_denominator)
scale: "numerator", "denominator", or NULL, indicating whether to standardize each numeric variable according to the numerator means and standard deviations, the denominator means and standard deviations, or apply no standardization at all.
constrained: logical equals FALSE to use unconstrained optimization, TRUE to use constrained optimization. Defaults to FALSE.
nsigma: Integer indicating the number of sigma values (bandwidth parameter of the Gaussian kernel gram matrix) to use in cross-validation.
sigma_quantile: NULL or numeric vector with probabilities to calculate the quantiles of the distance matrix to obtain sigma values. If NULL, nsigma values between 0.25 and 0.75 are used.
sigma: NULL or a scalar value to determine the bandwidth of the Gaussian kernel gram matrix. If NULL, nsigma values between 0.25 and 0.75 are used.
ncenters: Maximum number of Gaussian centers in the kernel gram matrix. Defaults to all numerator samples.
centers: Option to specify the Gaussian samples manually.
cv: Logical indicating whether or not to do cross-validation
nfold: Number of cross-validation folds used in order to calculate the optimal sigma value (default is 5-fold cv).
parallel: logical indicating whether to use parallel processing in the cross-validation scheme.
nthreads: NULL or integer indicating the number of threads to use for parallel processing. If parallel processing is enabled, it defaults to the number of available threads minus one.
progressbar: Logical indicating whether or not to display a progressbar.
osqp_settings: Optional: settings to pass to the osqp solver for constrained optimization.
cluster: Optional: a cluster object to use for parallel processing, see parallel::makeCluster.

References

Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., & Schölkopf, B. (2006). Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, edited by B. Schölkopf, J. Platt and T. Hoffman. Available from https://proceedings.neurips.cc/paper/2006/hash/a2186aa7c086b46ad4e8bf81e2a3a19b-Abstract.html.

Examples

Run this code

set.seed(123)
# Fit model
dr <- kmm(numerator_small, denominator_small)
# Inspect model object
dr
# Obtain summary of model object
summary(dr)
# Plot model object
plot(dr)
# Plot density ratio for each variable individually
plot_univariate(dr)
# Plot density ratio for each pair of variables
plot_bivariate(dr)
# Predict density ratio and inspect first 6 predictions
head(predict(dr))
# Fit model with custom parameters
kmm(numerator_small, denominator_small,
    nsigma = 5, ncenters = 100, nfold = 10,
    constrained = TRUE)

Run the code above in your browser using DataLab