Learn Python and AI for free! One week only. No credit card needed.
Ends in:
Maximum (cross-validation) likelihood estimation for fitting kernel density estimator for a variety of possible kernels, by treating it as a mixture model.
fkden(x, linit = NULL, bwinit = NULL, kernel = "gaussian",
extracentres = NULL, add.jitter = FALSE, factor = 0.1, amount = NULL,
std.err = TRUE, method = "BFGS", control = list(maxit = 10000),
finitelik = TRUE, ...)lkden(x, lambda = NULL, bw = NULL, kernel = "gaussian",
extracentres = NULL, log = TRUE)
nlkden(lambda, x, bw = NULL, kernel = "gaussian", extracentres = NULL,
finitelik = FALSE)
vector of sample data
initial value for bandwidth (as kernel half-width) or NULL
initial value for bandwidth (as kernel standard deviations) or NULL
kernel name (default = "gaussian"
)
extra kernel centres used in KDE,
but likelihood contribution not evaluated, or NULL
logical, whether jitter is needed for rounded kernel centres
see jitter
see jitter
logical, should standard errors be calculated
optimisation method (see optim
)
optimisation control list (see optim
)
logical, should log-likelihood return finite value for invalid parameters
optional inputs passed to optim
bandwidth for kernel (as half-width of kernel) or NULL
bandwidth for kernel (as standard deviations of kernel) or NULL
logical, if TRUE
then log-likelihood rather than likelihood is output
Log-likelihood is given by lkden
and it's
wrappers for negative log-likelihood from nlkden
.
Fitting function fkden
returns a simple list with the
following elements
call : |
optim call |
x : |
(jittered) data vector x |
kerncentres : |
actual kernel centres used x |
init : |
linit for lambda |
optim : |
complete optim output |
mle : |
vector of MLE of bandwidth |
cov : |
variance of MLE of bandwidth |
se : |
standard error of MLE of bandwidth |
nllh : |
minimum negative cross-validation log-likelihood |
n : |
total sample size |
lambda : |
MLE of lambda (kernel half-width) |
bw : |
MLE of bw (kernel standard deviations) |
kernel : |
kernel name |
Two important practical issues arise with MLE for the kernel bandwidth:
1) Cross-validation likelihood is needed for the KDE bandwidth parameter
as the usual likelihood degenerates, so that the MLE
This problem occassionally rears its ugly head for data which has been heavily rounded,
as even when using cross-validation the density can be non-zero even if the bandwidth is zero.
To overcome this issue an option to add a small jitter should be added to the data
(x
only) has been included in the fitting inputs, using the
jitter
function, to remove the ties. The default options red in the
jitter
are specified above, but the user can override these.
Notice the default scaling factor=0.1
, which is a tenth of the default value in the
jitter
function itself.
A warning message is given if the data appear to be rounded (i.e. more than 5 data rounding is the likely culprit. Only use the jittering when the MLE of the bandwidth is far too small.
2) For heavy tailed populations the bandwidth is positively biased, giving oversmoothing
(see example). The bias is due to the distance between the upper (or lower) order statistics not
necessarily decaying to zero as the sample size tends to infinity. Essentially, as the distance
between the two largest (or smallest) sample datapoints does not decay to zero, some smoothing between
them is required (i.e. bandwidth cannot be zero). One solution to this problem is to trim
the data at a suitable threshold to remove the problematic tail from the inference for the bandwidth,
using either the fkdengpd
function for a single heavy tail
or the fgkg
function
if both tails are heavy. See MacDonald et al (2013).
See Acknowledgments in
fnormgpd
, type help fnormgpd
. Based on code
by Anna MacDonald produced for MATLAB.
The kernel density estimator (KDE) with one of possible kernels is fitted to the entire dataset using maximum (cross-validation) likelihood estimation. The estimated bandwidth, variance and standard error are automatically output.
The alternate bandwidth definitions are discussed in the
kernels
, with the lambda
used here but
bw
also output. The bw
specification is the same as used in the
density
function.
The possible kernels are also defined in kernels
help
documentation with the "gaussian"
as the default choice.
Missing values (NA
and NaN
) are assumed to be invalid data so are ignored.
Cross-validation likelihood is used for kernel density component, obtained by
leaving each point out in turn and evaluating the KDE at the point left out:
Normally for likelihood estimation of the bandwidth the kernel centres and
the data where the likelihood is evaluated are the same. However, when using
KDE for extreme value mixture modelling the likelihood only those data in the
bulk of the distribution should contribute to the likelihood, but all the
data (including those beyond the threshold) should contribute to the density
estimate. The extracentres
option allows the use to specify extra
kernel centres used in estimating the density, but not evaluated in the
likelihood. Suppose the first nb
data are below the threshold, followed
by nu
exceedances of the threshold, so extracentres=NULL
.
The following functions are provided:
fkden
- maximum (cross-validation) likelihood fitting with all the above options;
lkden
- cross-validation log-likelihood;
nlkden
- negative cross-validation log-likelihood;
The log-likelihood functions are provided for wider usage, e.g. constructing profile likelihood functions.
The log-likelihood and negative log-likelihood are also provided for wider
usage, e.g. constructing your own extreme value
mixture models or profile likelihood functions. The parameter
lambda
must be specified in the negative log-likelihood
nlkden
.
Log-likelihood calculations are carried out in
lkden
, which takes bandwidths as inputs in
the same form as distribution functions. The negative log-likelihood is a
wrapper for lkden
, designed towards making
it useable for optimisation (e.g. lambda
given as first input).
Defaults values for the bandwidth linit
and lambda
are given in the fitting
fkden
and cross-validation likelihood functions
lkden
. The bandwidth linit
must be specified in
the negative log-likelihood function nlkden
.
Missing values (NA
and NaN
) are assumed to be invalid data so are ignored,
which is inconsistent with the evd
library which assumes the
missing values are below the threshold.
The function lkden
carries out the calculations
for the log-likelihood directly, which can be exponentiated to give actual
likelihood using (log=FALSE
).
The default optimisation algorithm is "BFGS", which requires a finite negative
log-likelihood function evaluation finitelik=TRUE
. For invalid
parameters, a zero likelihood is replaced with exp(-1e6)
. The "BFGS"
optimisation algorithms require finite values for likelihood, so any user
input for finitelik
will be overridden and set to finitelik=TRUE
if either of these optimisation methods is chosen.
It will display a warning for non-zero convergence result comes from
optim
function call or for common indicators of lack
of convergence (e.g. estimated bandwidth equal to initial value).
If the hessian is of reduced rank then the variance covariance (from inverse hessian)
and standard error of parameters cannot be calculated, then by default
std.err=TRUE
and the function will stop. If you want the parameter estimates
even if the hessian is of reduced rank (e.g. in a simulation study) then
set std.err=FALSE
.
http://en.wikipedia.org/wiki/Kernel_density_estimation
http://en.wikipedia.org/wiki/Cross-validation_(statistics)
Scarrott, C.J. and MacDonald, A. (2012). A review of extreme value threshold estimation and uncertainty quantification. REVSTAT - Statistical Journal 10(1), 33-59. Available from http://www.ine.pt/revstat/pdf/rs120102.pdf
Hu Y. and Scarrott, C.J. (2018). evmix: An R Package for Extreme Value Mixture Modeling, Threshold Estimation and Boundary Corrected Kernel Density Estimation. Journal of Statistical Software 84(5), 1-27. doi: 10.18637/jss.v084.i05.
Bowman, A.W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71(2), 353-360.
Duin, R.P.W. (1976). On the choice of smoothing parameters for Parzen estimators of probability density functions. IEEE Transactions on Computers C25(11), 1175-1179.
MacDonald, A., Scarrott, C.J., Lee, D., Darlow, B., Reale, M. and Russell, G. (2011). A flexible extreme value mixture model. Computational Statistics and Data Analysis 55(6), 2137-2157.
MacDonald, A., C. J. Scarrott, and D. S. Lee (2011). Boundary correction, consistency and robustness of kernel densities using extreme value theory. Submitted. Available from: http://www.math.canterbury.ac.nz/~c.scarrott.
Wand, M. and Jones, M.C. (1995). Kernel Smoothing. Chapman && Hall.
kernels
, kfun
,
jitter
, density
and
bw.nrd0
Other kden kdengpd kdengpdcon bckden bckdengpd bckdengpdcon
fkden fkdengpd fkdengpdcon fbckden fbckdengpd fbckdengpdcon: bckdengpdcon
,
bckdengpd
, bckden
,
fbckden
, kdengpdcon
,
kdengpd
, kden
# NOT RUN {
set.seed(1)
par(mfrow = c(1, 1))
nk=50
x = rnorm(nk)
xx = seq(-5, 5, 0.01)
fit = fkden(x)
hist(x, nk/5, freq = FALSE, xlim = c(-5, 5), ylim = c(0,0.6))
rug(x)
for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = fit$lambda)*0.05)
lines(xx,dnorm(xx), col = "black")
lines(xx, dkden(xx, x, lambda = fit$lambda), lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "green")
lines(density(x, bw = fit$bw), lwd = 2, lty = 2, col = "blue")
legend("topright", c("True Density", "KDE fitted evmix",
"KDE Using density, default bandwidth", "KDE Using density, c-v likelihood bandwidth"),
lty = c(1, 1, 2, 2), lwd = c(1, 2, 2, 2), col = c("black", "red", "green", "blue"))
par(mfrow = c(2, 1))
# bandwidth is biased towards oversmoothing for heavy tails
nk=100
x = rt(nk, df = 2)
xx = seq(-8, 8, 0.01)
fit = fkden(x)
hist(x, seq(floor(min(x)), ceiling(max(x)), 0.5), freq = FALSE, xlim = c(-8, 10))
rug(x)
for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = fit$lambda)*0.05)
lines(xx,dt(xx , df = 2), col = "black")
lines(xx, dkden(xx, x, lambda = fit$lambda), lwd = 2, col = "red")
legend("topright", c("True Density", "KDE fitted evmix, c-v likelihood bandwidth"),
lty = c(1, 1), lwd = c(1, 2), col = c("black", "red"))
# remove heavy tails from cv-likelihood evaluation, but still include them in KDE within likelihood
# often gives better bandwidth (see MacDonald et al (2011) for justification)
nk=100
x = rt(nk, df = 2)
xx = seq(-8, 8, 0.01)
fit2 = fkden(x[(x > -4) & (x < 4)], extracentres = x[(x <= -4) | (x >= 4)])
hist(x, seq(floor(min(x)), ceiling(max(x)), 0.5), freq = FALSE, xlim = c(-8, 10))
rug(x)
for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = fit2$lambda)*0.05)
lines(xx,dt(xx , df = 2), col = "black")
lines(xx, dkden(xx, x, lambda = fit2$lambda), lwd = 2, col = "red")
lines(xx, dkden(xx, x, lambda = fit$lambda), lwd = 2, col = "blue")
legend("topright", c("True Density", "KDE fitted evmix, tails removed",
"KDE fitted evmix, tails included"),
lty = c(1, 1, 1), lwd = c(1, 2, 2), col = c("black", "red", "blue"))
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab