An industrial-strength implementation of maximum likelihood estimation (MLE) for the parameters of any distribution in the Generalized Kumaraswamy (GKw) family. This function incorporates multiple advanced numerical techniques including trust region methods, eigenvalue-based regularization, adaptive scaling, and sophisticated line search to ensure robust convergence even for challenging datasets or extreme parameter values.
nrgkw(
start = NULL,
data = as.numeric(c()),
family = "gkw",
tol = 1e-06,
max_iter = 100L,
verbose = FALSE,
optimization_method = "trust-region",
enforce_bounds = TRUE,
min_param_val = 1e-05,
max_param_val = 1e+05,
adaptive_scaling = TRUE,
use_stochastic_perturbation = TRUE,
get_num_hess = FALSE,
multi_start_attempts = 3L,
eigenvalue_hessian_reg = TRUE,
max_backtrack = 20L,
initial_trust_radius = 1
)
A list object of class gkw_fit
containing the following components:
A named numeric vector with the estimated parameters.
The maximized value of the log-likelihood function.
Number of iterations performed.
Logical flag indicating whether the algorithm converged successfully.
A matrix where rows represent iterations and columns represent parameter values.
A vector of log-likelihood values at each iteration.
The gradient vector at the final parameter estimates.
The analytical Hessian matrix at the final parameter estimates.
A named numeric vector of approximate standard errors for the parameters.
Akaike Information Criterion.
Bayesian Information Criterion.
AIC with small sample correction.
The sample size.
A character string indicating the termination status.
A named numeric vector of Z-statistics for parameter significance testing.
A named numeric vector of two-sided p-values corresponding to the Z-statistics.
A character vector of the parameter names.
The distribution family used.
The optimization method used.
The numerically approximated Hessian at the solution (if requested).
The condition number of the final Hessian, a measure of parameter identifiability.
The scaling factors used for parameters (if adaptive scaling was enabled).
A numeric vector containing initial values for the parameters.
If NULL, automatic initialization is used based on the dataset characteristics.
The length and order must correspond to the selected family
(e.g., c(alpha, beta, gamma, delta, lambda)
for "gkw"; c(alpha, beta)
for "kw";
c(gamma, delta)
for "beta").
A numeric vector containing the observed data. All values must be strictly between 0 and 1.
A character string specifying the distribution family. One of
"gkw"
, "bkw"
, "kkw"
, "ekw"
, "mc"
,
"kw"
, or "beta"
. Default: "gkw"
.
Convergence tolerance. The algorithm stops when the Euclidean norm
of the gradient is below this value, or if relative changes in parameters
or the negative log-likelihood are below this threshold across consecutive
iterations. Default: 1e-6
.
Maximum number of iterations allowed. Default: 100
.
Logical; if TRUE
, prints detailed progress information
during optimization, including iteration number, negative log-likelihood,
gradient norm, and step adjustment details. Default: FALSE
.
Character string specifying the optimization method: "trust-region" (default), "newton-raphson", or "hybrid" (starts with trust-region, switches to newton-raphson near convergence).
Logical; if TRUE
, parameter values are constrained
to stay within min_param_val
, max_param_val
(and \(\delta \ge 0\))
during optimization. Default: TRUE
.
Minimum allowed value for parameters constrained to be
strictly positive (\(\alpha, \beta, \gamma, \lambda\)). Default: 1e-5
.
Maximum allowed value for all parameters. Default: 1e5
.
Logical; if TRUE
, parameters are automatically scaled
to improve numerical stability. Default: TRUE
.
Logical; if TRUE
, applies random perturbations
when optimization stalls. Default: TRUE
.
Logical; if TRUE
, computes and returns a numerical
approximation of the Hessian at the solution. Default: FALSE
.
Integer specifying the number of different starting points
to try if initial optimization fails to converge. Default: 3
.
Logical; if TRUE
, uses eigenvalue-based
regularization for the Hessian matrix. Default: TRUE
.
Integer specifying the maximum number of backtracking steps
in line search. Default: 20
.
Initial radius for trust region method. Default: 1.0
.
Although this implementation is highly robust, fitting complex distributions can still be challenging. For best results:
Try multiple starting values if results seem suboptimal
Examine diagnostic information carefully, especially condition numbers and standard errors
Be cautious of parameter estimates at or very near boundaries
Consider model simplification if convergence is consistently problematic
For the full GKw model with 5 parameters, convergence may be sensitive to starting values
High condition numbers (>1e6) may indicate parameter redundancy or weak identifiability
Enhanced by Lopes, J. E.
This enhanced algorithm provides robust parameter estimation for the Generalized Kumaraswamy distribution and its subfamilies. The function implements several advanced numerical optimization techniques to maximize the likelihood function reliably even in difficult cases.
The Generalized Kumaraswamy (GKw) distribution, introduced by Carrasco, Ferrari, and Cordeiro (2010), is a flexible five-parameter continuous distribution defined on the standard unit interval (0,1). Its probability density function is given by:
$$f(x; \alpha, \beta, \gamma, \delta, \lambda) = \frac{\lambda\alpha\beta x^{\alpha-1}}{B(\gamma, \delta+1)}(1-x^{\alpha})^{\beta-1}[1-(1-x^{\alpha})^{\beta}]^{\gamma\lambda-1}\{1-[1-(1-x^{\alpha})^{\beta}]^{\lambda}\}^{\delta}$$
where \(\alpha, \beta, \gamma, \lambda > 0\) and \(\delta \geq 0\) are the model parameters, and \(B(\gamma, \delta+1)\) is the beta function.
The GKw distribution encompasses several important special cases:
GKw (5 parameters): \(\alpha, \beta, \gamma, \delta, \lambda\)
BKw (4 parameters): \(\alpha, \beta, \gamma, \delta\) (with \(\lambda = 1\))
KKw (4 parameters): \(\alpha, \beta, \delta, \lambda\) (with \(\gamma = 1\))
EKw (3 parameters): \(\alpha, \beta, \lambda\) (with \(\gamma = 1, \delta = 0\))
Mc (3 parameters): \(\gamma, \delta, \lambda\) (with \(\alpha = 1, \beta = 1\))
Kw (2 parameters): \(\alpha, \beta\) (with \(\gamma = 1, \delta = 0, \lambda = 1\))
Beta(2 parameters): \(\gamma, \delta\) (with \(\alpha = 1, \beta = 1, \lambda = 1\))
The trust region approach restricts parameter updates to a region where the quadratic approximation of the objective function is trusted to be accurate. This algorithm implements the Levenberg-Marquardt variant, which solves the subproblem:
$$\min_p m_k(p) = -\nabla \ell(\theta_k)^T p + \frac{1}{2}p^T H_k p$$ $$\text{subject to } \|p\| \leq \Delta_k$$
where \(\nabla \ell(\theta_k)\) is the gradient, \(H_k\) is the Hessian, and \(\Delta_k\) is the trust region radius at iteration \(k\).
The Levenberg-Marquardt approach adds a regularization parameter \(\lambda\) to the Hessian, solving:
$$(H_k + \lambda I)p = -\nabla \ell(\theta_k)$$
The parameter \(\lambda\) controls the step size and direction:
When \(\lambda\) is large, the step approaches a scaled steepest descent direction.
When \(\lambda\) is small, the step approaches the Newton direction.
The algorithm dynamically adjusts \(\lambda\) based on the agreement between the quadratic model and the actual function:
$$\rho_k = \frac{f(\theta_k) - f(\theta_k + p_k)}{m_k(0) - m_k(p_k)}$$
The trust region radius is updated according to:
If \(\rho_k < 0.25\), reduce the radius: \(\Delta_{k+1} = 0.5\Delta_k\)
If \(\rho_k > 0.75\) and \(\|p_k\| \approx \Delta_k\), increase the radius: \(\Delta_{k+1} = 2\Delta_k\)
Otherwise, leave the radius unchanged: \(\Delta_{k+1} = \Delta_k\)
The step is accepted if \(\rho_k > \eta\) (typically \(\eta = 0.1\)).
For the Newton-Raphson method to converge, the Hessian matrix must be positive definite. This algorithm uses eigendecomposition to create a positive definite approximation that preserves the Hessian's structure:
$$H = Q\Lambda Q^T$$
where \(Q\) contains the eigenvectors and \(\Lambda\) is a diagonal matrix of eigenvalues.
The regularized Hessian is constructed by:
$$\tilde{H} = Q\tilde{\Lambda}Q^T$$
where \(\tilde{\Lambda}\) contains modified eigenvalues:
$$\tilde{\lambda}_i = \max(\lambda_i, \epsilon)$$
with \(\epsilon\) being a small positive threshold (typically \(10^{-6}\)).
This approach is superior to diagonal loading (\(H + \lambda I\)) as it:
Preserves the eigenvector structure of the original Hessian
Minimally modifies the eigenvalues needed to ensure positive definiteness
Better maintains the directional information in the Hessian
When parameters have widely different magnitudes, optimization can become numerically unstable. The adaptive scaling system transforms the parameter space to improve conditioning:
$$\theta_i^{scaled} = s_i \theta_i$$
where scaling factors \(s_i\) are determined by:
For large parameters (\(|\theta_i| > 100\)): \(s_i = 100/|\theta_i|\)
For small parameters (\(0 < |\theta_i| < 0.01\)): \(s_i = 0.01/|\theta_i|\)
Otherwise: \(s_i = 1\)
The optimization is performed in the scaled space, with appropriate transformations for the gradient and Hessian:
$$\nabla \ell(\theta^{scaled})_i = \frac{1}{s_i}\nabla \ell(\theta)_i$$ $$H(\theta^{scaled})_{ij} = \frac{1}{s_i s_j}H(\theta)_{ij}$$
The final results are back-transformed to the original parameter space before being returned.
The line search procedure ensures sufficient decrease in the objective function when taking a step in the search direction. The implementation uses Wolfe conditions which include both:
Sufficient decrease (Armijo) condition: $$f(\theta_k + \alpha p_k) \leq f(\theta_k) + c_1 \alpha \nabla f(\theta_k)^T p_k$$
Curvature condition: $$|\nabla f(\theta_k + \alpha p_k)^T p_k| \leq c_2 |\nabla f(\theta_k)^T p_k|$$
where \(0 < c_1 < c_2 < 1\), typically \(c_1 = 10^{-4}\) and \(c_2 = 0.9\).
The step length \(\alpha\) is determined using polynomial interpolation:
First iteration: quadratic interpolation
Subsequent iterations: cubic interpolation using function and derivative values
The cubic polynomial model has the form: $$a\alpha^3 + b\alpha^2 + c\alpha + d$$
The algorithm computes coefficients from values at two points, then finds the minimizer of this polynomial subject to bounds to ensure convergence.
When analytical derivatives are unreliable, the algorithm uses numerical differentiation with adaptive step sizes based on parameter magnitudes:
$$h_i = \max(h_{min}, \min(h_{base}, h_{base} \cdot |\theta_i|))$$
where \(h_{min}\) is a minimum step size (typically \(10^{-8}\)), \(h_{base}\) is a base step size (typically \(10^{-5}\)), and \(\theta_i\) is the parameter value.
For computing diagonal Hessian elements, the central difference formula is used:
$$\frac{\partial^2 f}{\partial \theta_i^2} \approx \frac{f(\theta + h_i e_i) - 2f(\theta) + f(\theta - h_i e_i)}{h_i^2}$$
For mixed partial derivatives:
$$\frac{\partial^2 f}{\partial \theta_i \partial \theta_j} \approx \frac{f(\theta + h_i e_i + h_j e_j) - f(\theta + h_i e_i - h_j e_j) - f(\theta - h_i e_i + h_j e_j) + f(\theta - h_i e_i - h_j e_j)}{4h_i h_j}$$
The algorithm validates numerical differentiation by comparing with existing gradients and adaptively adjusts step sizes when discrepancies are detected.
To escape flat regions or local minima, the algorithm implements controlled stochastic perturbation when progress stalls (detected by monitoring gradient norm changes):
$$\theta_i^{new} = \theta_i + \Delta\theta_i$$
where the perturbation \(\Delta\theta_i\) combines:
A directed component opposite to the gradient: \(-\text{sign}(\nabla \ell_i) \cdot 0.05 \cdot |\theta_i|\)
A random noise component: \(U(-0.05|\theta_i|, 0.05|\theta_i|)\)
The perturbation is applied when:
The relative change in gradient norm is below a threshold for several consecutive iterations
The algorithm appears to be stuck in a non-optimal region
The perturbation is accepted only if it improves the objective function value.
For particularly challenging optimization landscapes, the algorithm can employ multiple starting points:
Initial values are generated using moment-based estimation and domain knowledge about each distribution family
Each initial point is randomly perturbed to explore different regions of the parameter space
Independent optimization runs are performed from each starting point
The best result (based on likelihood value and convergence status) is returned
This approach increases the probability of finding the global optimum or a high-quality local optimum, particularly for complex models with many parameters.
Intelligent starting values are critical for convergence in complex models. The algorithm uses data-driven initialization based on:
Method of moments estimators for beta parameters: $$\alpha + \beta = \frac{\bar{x}(1-\bar{x})}{s^2} - 1$$ $$\alpha = (\alpha + \beta)\bar{x}$$
Transformations to Kumaraswamy parameters: $$a_{Kw} = \sqrt{\alpha_{Beta}}$$ $$b_{Kw} = \sqrt{\beta_{Beta}}$$
Adjustments based on data skewness (detected via mean relative to 0.5)
Corrections based on data dispersion (range relative to (0,1) interval)
The transformations between beta and Kumaraswamy parameters leverage the similarities between these distributions while accounting for their parametric differences.
The algorithm can dynamically switch between trust region and Newton-Raphson methods based on optimization progress:
Early iterations: trust region method for stability and global convergence properties
Later iterations (when close to optimum): Newton-Raphson with line search for quadratic convergence rate
The switching criteria are based on iteration count and gradient norm, with additional logic to handle cases where one method consistently fails.
Carrasco, J. M. F., Ferrari, S. L. P., & Cordeiro, G. M. (2010). A new generalized Kumaraswamy distribution. arXiv preprint arXiv:1004.0911.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.
Conn, A. R., Gould, N. I. M., & Toint, P. L. (2000). Trust Region Methods. MPS-SIAM Series on Optimization.
Kumaraswamy, P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology, 46(1-2), 79-88.
# \donttest{
# Generate sample data from a Beta(2,5) distribution for testing
set.seed(123)
sample_data <- rbeta_(200, 2, 5)
# Automatic initialization (recommended for beginners)
result_auto <- nrgkw(NULL, sample_data, family = "beta", verbose = FALSE)
print(result_auto$parameters)
print(result_auto$loglik)
# Compare different optimization methods
methods <- c("trust-region", "newton-raphson", "hybrid")
results <- list()
for (method in methods) {
results[[method]] <- nrgkw(NULL, sample_data, family = "beta",
optimization_method = method)
cat(sprintf("Method: %s, AIC: %.4f\n", method, results[[method]]$aic))
}
# Fit the full GKw model with diagnostic information
gkw_result <- nrgkw(NULL, sample_data, family = "gkw",
verbose = FALSE, get_num_hess = TRUE)
# Examine parameter identifiability through condition number
cat(sprintf("Condition number: %.2e\n", gkw_result$condition_number))
print(gkw_result)
# Compare with simpler models using information criteria
cat("Information criteria comparison:\n")
cat(sprintf("GKw: AIC=%.4f, BIC=%.4f\n", gkw_result$aic, gkw_result$bic))
cat(sprintf("Beta: AIC=%.4f, BIC=%.4f\n",
results[["trust-region"]]$aic, results[["trust-region"]]$bic))
# }
Run the code above in your browser using DataLab