This function calculates maximum-similarity cross-validated bandwidths for the
distance using kernel summation similarity. This implementation uses the method
described in Ghashti (2024) for mixed-type data that includes any of numeric
(continuous), factor (nominal), and ordered factor (ordinal) variables.
mscv.dkss
calculates the bandwidths associated with each kernel function
for variable types and returns a numeric vector of bandwidths that can be used
with the dkss
pairwise distance calculation.
mscv.dkss(df, nstart = NULL, ckernel = "c_gaussian", ukernel = "u_aitken",
okernel = "o_wangvanryzin", verbose = FALSE)
mscv.dkss
returns a list
object, with the
following components:
a \(p\)-variate vector of bandwidth values, intended to be used for
the dkss
pairwise distance calculation
a numeric value of the MSCV objective function, obtained using
the optim
function for constrained multivariate optimization
a \(p\)-variate data frame. The data types may be continuous
(numeric
), nominal (factor
), ordinal
(ordered
), or any combination thereof. Columns of df
should be set to the appropriate data type class.
integer number of restarts for the process of finding extrema of the mscv
function from random initial bandwidth parameters (starting points). If the
default of NULL
is used, then the number of restarts will be
\(min(3,\text{ncol(df)})\).
character string specifying the continuous kernel function. Options include
c_gaussian
, c_epanechnikov
, c_uniform
, c_triangle
,
c_biweight
, c_triweight
, c_tricube
, c_cosine
,
c_logistic
, c_sigmoid
, and c_silverman
. Note that if
using np
for bw
selection above, continuous kernel types are
restricted to either c_gaussian
, c_epanechnikov
, or
c_uniform
. Defaults to c_gaussian
. See details.
character string specifying the nominal kernel function for unordered factors.
Options include u_aitken
and u_aitchisonaitken
. Defaults to
u_aitken
. See details.
character string specifying the ordinal kernel function for ordered factors.
Options include o_aitken
, o_aitchisonaitken
, o_habbema
,
o_wangvanryzin
, and o_liracine
. Note that if using np
for bw
selection above, ordinal kernel types are restricted to either
o_wangvanryzin
or o_liracine
. Defaults to o_wangvanryzin
.
See details.
a logical value which specifies whether to output the \(i\)-th iteration of
the total number of nstarts
, and output if the optimization procedure
converges. Defaults to FALSE
.
John R. J. Thompson john.thompson@ubc.ca, Jesse S. Ghashti jesse.ghashti@ubc.ca
mscv.dkss
implements the maximum-similarity cross-validation (MSCV)
bandwidth selection technique for the dkss
function, described
by Ghashti (2024). This approach uses summation kernels for continuous,
nominal and ordinal data, which are then summed over all variable types to
return the pairwise distance between mixed-type data.
The maximization procedure for bandwidth selection is based on the objective \(\text{arg}\max_{\boldsymbol{\lambda}}\left\{\frac{1}{n}\sum_{i=1}^n\log\left(\frac{1}{(n-1)}\sum_{\substack{j=1 \\ j \ne i}}^ns_{\text{KSS}_{\boldsymbol{\lambda}}}(\textbf{x}_i,\textbf{x}_j)\right)\right\},\) where
\(s_{\text{KSS}}(\textbf{x}_i, \textbf{x}_j \ | \boldsymbol{\lambda}) = \sum_{k=1}^{p_c}K(x_{i,k}^c, x_{j,k}^c, \lambda_k^c) + \sum_{k=1}^{p_u}L(x_{i,k}^u,x_{j,k}^u,\lambda_k^u) + \sum_{k=1}^{p_o}\ell(x_{i,k}^o,x_{j,k}^o,\lambda_k^o).\)
\(K(\cdot)\), \(L(\cdot)\), and \(\ell(\cdot)\) are the continuous,
nominal, and ordinal kernel functions, repectively, with \(\lambda_k\)'s
representing kernel specifical bandwiths for the \(k\)-th variable, and
\(p_c\), \(p_u\), \(p_o\) the number of continuous, nominal, and ordinal
variables in the data frame df
. The bw
vector returned is the
bandwidths that yield the highest objective function value.
Data contained in the data frame df
may constitute any combinations of
continuous, nominal, or ordinal data, which is to be specified in the data
frame df
using numeric
for continuous data,
factor
for nominal data, and ordered
for ordinal
data. Data can be entered in an arbitrary order and data types will be
detected automatically. User-inputted vectors of bandwidths bw
must be
defined in the same order as the variables in the data frame df
, as to
ensure they sorted accordingly by the routine.
The are many kernels which can be specified by the user. Continuous kernel functions may be found in Cameron and Trivedi (2005), Härdle et al. (2004) or Silverman (1986). Nominal kernels use a variation of Aitchison and Aitken's (1976) kernel. Ordinal kernels use a variation of the Wang and van Ryzin (1981) kernel. All nominal and ordinal kernel functions can be found in Li and Racine (2007), Li and Racine (2003), Ouyan et al. (2006), and Titterington and Bowman (1985).
Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method”, Biometrika, 63, 413-420.
Cameron, A. and P. Trivedi (2005), “Microeconometrics: Methods and Applications”, Cambridge University Press.
Ghashti, J.S. (2024), “Similarity Maximization and Shrinkage Approach in Kernel Metric Learning for Clustering Mixed-type Data”, University of British Columbia.
Härdle, W., and M. Müller and S. Sperlich and A. Werwatz (2004), “Nonparametric and Semiparametric Models”, (Vol. 1). Berlin: Springer.
Li, Q. and J.S. Racine (2007), “Nonparametric Econometrics: Theory and Practice”, Princeton University Press.
Li, Q. and J.S. Racine (2003), “Nonparametric estimation of distributions with categorical and continuous data”, Journal of Multivariate Analysis, 86, 266-292.
Ouyang, D. and Q. Li and J.S. Racine (2006), “Cross-validation and the estimation of probability distributions with categorical data”, Journal of Nonparametric Statistics, 18, 69-100.
Silverman, B.W. (1986), “Density Estimation”, London: Chapman and Hall.
Titterington, D.M. and A.W. Bowman (1985), “A comparative study of smoothing procedures for ordered categorical data”, Journal of Statistical Computation and Simulation, 21(3-4), 291-312.
Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions”, Biometrika, 68, 301-309.
mscv.dkps
, dkss
, dkps
# \donttest{
# example data frame with mixed numeric, nominal, and ordinal data.
levels = c("Low", "Medium", "High")
df <- data.frame(
x1 = runif(100, 0, 100),
x2 = factor(sample(c("A", "B", "C"), 100, TRUE)),
x3 = factor(sample(c("A", "B", "C"), 100, TRUE)),
x4 = rnorm(100, 10, 3),
x5 = ordered(sample(c("Low", "Medium", "High"), 100, TRUE), levels = levels),
x6 = ordered(sample(c("Low", "Medium", "High"), 100, TRUE), levels = levels))
# minimal implementation requires just the data frame, with defaults
bw <- mscv.dkss(df = df)
# specify number of starts and kernel functions
bw2 <- mscv.dkss(df = df, nstart = 5, ckernel = "c_triangle",
ukernel = "u_aitken", okernel = "o_liracine")
# }
Run the code above in your browser using DataLab