Learn R Programming

densityratio (version 0.2.1)

kmm: Kernel mean matching approach to density ratio estimation

Description

Kernel mean matching approach to density ratio estimation

Usage

kmm(
  df_numerator,
  df_denominator,
  scale = "numerator",
  constrained = FALSE,
  nsigma = 10,
  sigma_quantile = NULL,
  sigma = NULL,
  ncenters = 200,
  centers = NULL,
  cv = TRUE,
  nfold = 5,
  parallel = FALSE,
  nthreads = NULL,
  progressbar = TRUE,
  osqp_settings = NULL,
  cluster = NULL
)

Value

kmm-object, containing all information to calculate the density ratio using optimal sigma and optimal weights.

Arguments

df_numerator

data.frame with exclusively numeric variables with the numerator samples

df_denominator

data.frame with exclusively numeric variables with the denominator samples (must have the same variables as df_denominator)

scale

"numerator", "denominator", or NULL, indicating whether to standardize each numeric variable according to the numerator means and standard deviations, the denominator means and standard deviations, or apply no standardization at all.

constrained

logical equals FALSE to use unconstrained optimization, TRUE to use constrained optimization. Defaults to FALSE.

nsigma

Integer indicating the number of sigma values (bandwidth parameter of the Gaussian kernel gram matrix) to use in cross-validation.

sigma_quantile

NULL or numeric vector with probabilities to calculate the quantiles of the distance matrix to obtain sigma values. If NULL, nsigma values between 0.25 and 0.75 are used.

sigma

NULL or a scalar value to determine the bandwidth of the Gaussian kernel gram matrix. If NULL, nsigma values between 0.25 and 0.75 are used.

ncenters

Maximum number of Gaussian centers in the kernel gram matrix. Defaults to all numerator samples.

centers

Option to specify the Gaussian samples manually.

cv

Logical indicating whether or not to do cross-validation

nfold

Number of cross-validation folds used in order to calculate the optimal sigma value (default is 5-fold cv).

parallel

logical indicating whether to use parallel processing in the cross-validation scheme.

nthreads

NULL or integer indicating the number of threads to use for parallel processing. If parallel processing is enabled, it defaults to the number of available threads minus one.

progressbar

Logical indicating whether or not to display a progressbar.

osqp_settings

Optional: settings to pass to the osqp solver for constrained optimization.

cluster

Optional: a cluster object to use for parallel processing, see parallel::makeCluster.

References

Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., & Schölkopf, B. (2006). Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, edited by B. Schölkopf, J. Platt and T. Hoffman. Available from https://proceedings.neurips.cc/paper/2006/hash/a2186aa7c086b46ad4e8bf81e2a3a19b-Abstract.html.

Examples

Run this code
set.seed(123)
# Fit model
dr <- kmm(numerator_small, denominator_small)
# Inspect model object
dr
# Obtain summary of model object
summary(dr)
# Plot model object
plot(dr)
# Plot density ratio for each variable individually
plot_univariate(dr)
# Plot density ratio for each pair of variables
plot_bivariate(dr)
# Predict density ratio and inspect first 6 predictions
head(predict(dr))
# Fit model with custom parameters
kmm(numerator_small, denominator_small,
    nsigma = 5, ncenters = 100, nfold = 10,
    constrained = TRUE)

Run the code above in your browser using DataLab