nrgkw: Enhanced Newton-Raphson Optimization for GKw Family Distributions

Description

An industrial-strength implementation of maximum likelihood estimation (MLE) for the parameters of any distribution in the Generalized Kumaraswamy (GKw) family. This function incorporates multiple advanced numerical techniques including trust region methods, eigenvalue-based regularization, adaptive scaling, and sophisticated line search to ensure robust convergence even for challenging datasets or extreme parameter values.

Usage

nrgkw(
  start = NULL,
  data = as.numeric(c()),
  family = "gkw",
  tol = 1e-06,
  max_iter = 100L,
  verbose = FALSE,
  optimization_method = "trust-region",
  enforce_bounds = TRUE,
  min_param_val = 1e-05,
  max_param_val = 1e+05,
  adaptive_scaling = TRUE,
  use_stochastic_perturbation = TRUE,
  get_num_hess = FALSE,
  multi_start_attempts = 3L,
  eigenvalue_hessian_reg = TRUE,
  max_backtrack = 20L,
  initial_trust_radius = 1
)

Value

A list object of class gkw_fit containing the following components:

parameters: A named numeric vector with the estimated parameters.
loglik: The maximized value of the log-likelihood function.
iterations: Number of iterations performed.
converged: Logical flag indicating whether the algorithm converged successfully.
param_history: A matrix where rows represent iterations and columns represent parameter values.
loglik_history: A vector of log-likelihood values at each iteration.
gradient: The gradient vector at the final parameter estimates.
hessian: The analytical Hessian matrix at the final parameter estimates.
std_errors: A named numeric vector of approximate standard errors for the parameters.
aic: Akaike Information Criterion.
bic: Bayesian Information Criterion.
aicc: AIC with small sample correction.
n: The sample size.
status: A character string indicating the termination status.
z_values: A named numeric vector of Z-statistics for parameter significance testing.
p_values: A named numeric vector of two-sided p-values corresponding to the Z-statistics.
param_names: A character vector of the parameter names.
family: The distribution family used.
optimization_method: The optimization method used.
numeric_hessian: The numerically approximated Hessian at the solution (if requested).
condition_number: The condition number of the final Hessian, a measure of parameter identifiability.
scaling_factors: The scaling factors used for parameters (if adaptive scaling was enabled).

Arguments

start: A numeric vector containing initial values for the parameters. If NULL, automatic initialization is used based on the dataset characteristics. The length and order must correspond to the selected family (e.g., c(alpha, beta, gamma, delta, lambda) for "gkw"; c(alpha, beta) for "kw"; c(gamma, delta) for "beta").
data: A numeric vector containing the observed data. All values must be strictly between 0 and 1.
family: A character string specifying the distribution family. One of "gkw", "bkw", "kkw", "ekw", "mc", "kw", or "beta". Default: "gkw".
tol: Convergence tolerance. The algorithm stops when the Euclidean norm of the gradient is below this value, or if relative changes in parameters or the negative log-likelihood are below this threshold across consecutive iterations. Default: 1e-6.
max_iter: Maximum number of iterations allowed. Default: 100.
verbose: Logical; if TRUE, prints detailed progress information during optimization, including iteration number, negative log-likelihood, gradient norm, and step adjustment details. Default: FALSE.
optimization_method: Character string specifying the optimization method: "trust-region" (default), "newton-raphson", or "hybrid" (starts with trust-region, switches to newton-raphson near convergence).
enforce_bounds: Logical; if TRUE, parameter values are constrained to stay within min_param_val, max_param_val (and $\delta \ge 0$) during optimization. Default: TRUE.
min_param_val: Minimum allowed value for parameters constrained to be strictly positive ($\alpha, \beta, \gamma, \lambda$). Default: 1e-5.
max_param_val: Maximum allowed value for all parameters. Default: 1e5.
adaptive_scaling: Logical; if TRUE, parameters are automatically scaled to improve numerical stability. Default: TRUE.
use_stochastic_perturbation: Logical; if TRUE, applies random perturbations when optimization stalls. Default: TRUE.
get_num_hess: Logical; if TRUE, computes and returns a numerical approximation of the Hessian at the solution. Default: FALSE.
multi_start_attempts: Integer specifying the number of different starting points to try if initial optimization fails to converge. Default: 3.
eigenvalue_hessian_reg: Logical; if TRUE, uses eigenvalue-based regularization for the Hessian matrix. Default: TRUE.
max_backtrack: Integer specifying the maximum number of backtracking steps in line search. Default: 20.
initial_trust_radius: Initial radius for trust region method. Default: 1.0.

Warning

Although this implementation is highly robust, fitting complex distributions can still be challenging. For best results:

Try multiple starting values if results seem suboptimal
Examine diagnostic information carefully, especially condition numbers and standard errors
Be cautious of parameter estimates at or very near boundaries
Consider model simplification if convergence is consistently problematic
For the full GKw model with 5 parameters, convergence may be sensitive to starting values
High condition numbers (>1e6) may indicate parameter redundancy or weak identifiability

Author

Enhanced by Lopes, J. E.

Details

This enhanced algorithm provides robust parameter estimation for the Generalized Kumaraswamy distribution and its subfamilies. The function implements several advanced numerical optimization techniques to maximize the likelihood function reliably even in difficult cases.

The GKw Family of Distributions

The Generalized Kumaraswamy (GKw) distribution, introduced by Carrasco, Ferrari, and Cordeiro (2010), is a flexible five-parameter continuous distribution defined on the standard unit interval (0,1). Its probability density function is given by:

$$f(x; \alpha, \beta, \gamma, \delta, \lambda) = \frac{\lambda\alpha\beta x^{\alpha-1}}{B(\gamma, \delta+1)}(1-x^{\alpha})^{\beta-1}[1-(1-x^{\alpha})^{\beta}]^{\gamma\lambda-1}\{1-[1-(1-x^{\alpha})^{\beta}]^{\lambda}\}^{\delta}$$

where $\alpha, \beta, \gamma, \lambda > 0$ and $\delta \geq 0$ are the model parameters, and $B(\gamma, \delta+1)$ is the beta function.

The GKw distribution encompasses several important special cases:

GKw (5 parameters): $\alpha, \beta, \gamma, \delta, \lambda$
BKw (4 parameters): $\alpha, \beta, \gamma, \delta$ (with $\lambda = 1$)
KKw (4 parameters): $\alpha, \beta, \delta, \lambda$ (with $\gamma = 1$)
EKw (3 parameters): $\alpha, \beta, \lambda$ (with $\gamma = 1, \delta = 0$)
Mc (3 parameters): $\gamma, \delta, \lambda$ (with $\alpha = 1, \beta = 1$)
Kw (2 parameters): $\alpha, \beta$ (with $\gamma = 1, \delta = 0, \lambda = 1$)
Beta(2 parameters): $\gamma, \delta$ (with $\alpha = 1, \beta = 1, \lambda = 1$)

Trust Region Method with Levenberg-Marquardt Algorithm

The trust region approach restricts parameter updates to a region where the quadratic approximation of the objective function is trusted to be accurate. This algorithm implements the Levenberg-Marquardt variant, which solves the subproblem:

$$\min_p m_k(p) = -\nabla \ell(\theta_k)^T p + \frac{1}{2}p^T H_k p$$ $$\text{subject to } \|p\| \leq \Delta_k$$

where $\nabla \ell(\theta_k)$ is the gradient, $H_k$ is the Hessian, and $\Delta_k$ is the trust region radius at iteration $k$.

The Levenberg-Marquardt approach adds a regularization parameter $\lambda$ to the Hessian, solving:

$$(H_k + \lambda I)p = -\nabla \ell(\theta_k)$$

The parameter $\lambda$ controls the step size and direction:

When $\lambda$ is large, the step approaches a scaled steepest descent direction.
When $\lambda$ is small, the step approaches the Newton direction.

The algorithm dynamically adjusts $\lambda$ based on the agreement between the quadratic model and the actual function:

$$\rho_k = \frac{f(\theta_k) - f(\theta_k + p_k)}{m_k(0) - m_k(p_k)}$$

The trust region radius is updated according to:

If $\rho_k < 0.25$, reduce the radius: $\Delta_{k+1} = 0.5\Delta_k$
If $\rho_k > 0.75$ and $\|p_k\| \approx \Delta_k$, increase the radius: $\Delta_{k+1} = 2\Delta_k$
Otherwise, leave the radius unchanged: $\Delta_{k+1} = \Delta_k$

The step is accepted if $\rho_k > \eta$ (typically $\eta = 0.1$).

Eigenvalue-Based Hessian Regularization

For the Newton-Raphson method to converge, the Hessian matrix must be positive definite. This algorithm uses eigendecomposition to create a positive definite approximation that preserves the Hessian's structure:

$$H = Q\Lambda Q^T$$

where $Q$ contains the eigenvectors and $\Lambda$ is a diagonal matrix of eigenvalues.

The regularized Hessian is constructed by:

$$\tilde{H} = Q\tilde{\Lambda}Q^T$$

where $\tilde{\Lambda}$ contains modified eigenvalues:

$$\tilde{\lambda}_i = \max(\lambda_i, \epsilon)$$

with $\epsilon$ being a small positive threshold (typically $10^{-6}$).

This approach is superior to diagonal loading ($H + \lambda I$) as it:

Preserves the eigenvector structure of the original Hessian
Minimally modifies the eigenvalues needed to ensure positive definiteness
Better maintains the directional information in the Hessian

Parameter Scaling for Numerical Stability

When parameters have widely different magnitudes, optimization can become numerically unstable. The adaptive scaling system transforms the parameter space to improve conditioning:

$$\theta_i^{scaled} = s_i \theta_i$$

where scaling factors $s_i$ are determined by:

For large parameters ($|\theta_i| > 100$): $s_i = 100/|\theta_i|$
For small parameters ($0 < |\theta_i| < 0.01$): $s_i = 0.01/|\theta_i|$
Otherwise: $s_i = 1$

The optimization is performed in the scaled space, with appropriate transformations for the gradient and Hessian:

$$\nabla \ell(\theta^{scaled})_i = \frac{1}{s_i}\nabla \ell(\theta)_i$$ $$H(\theta^{scaled})_{ij} = \frac{1}{s_i s_j}H(\theta)_{ij}$$

The final results are back-transformed to the original parameter space before being returned.

Line Search with Wolfe Conditions

The line search procedure ensures sufficient decrease in the objective function when taking a step in the search direction. The implementation uses Wolfe conditions which include both:

Sufficient decrease (Armijo) condition: $$f(\theta_k + \alpha p_k) \leq f(\theta_k) + c_1 \alpha \nabla f(\theta_k)^T p_k$$
Curvature condition: $$|\nabla f(\theta_k + \alpha p_k)^T p_k| \leq c_2 |\nabla f(\theta_k)^T p_k|$$

where $0 < c_1 < c_2 < 1$, typically $c_1 = 10^{-4}$ and $c_2 = 0.9$.

The step length $\alpha$ is determined using polynomial interpolation:

First iteration: quadratic interpolation
Subsequent iterations: cubic interpolation using function and derivative values

The cubic polynomial model has the form: $$a\alpha^3 + b\alpha^2 + c\alpha + d$$

The algorithm computes coefficients from values at two points, then finds the minimizer of this polynomial subject to bounds to ensure convergence.

Adaptive Numerical Differentiation

When analytical derivatives are unreliable, the algorithm uses numerical differentiation with adaptive step sizes based on parameter magnitudes:

$$h_i = \max(h_{min}, \min(h_{base}, h_{base} \cdot |\theta_i|))$$

where $h_{min}$ is a minimum step size (typically $10^{-8}$), $h_{base}$ is a base step size (typically $10^{-5}$), and $\theta_i$ is the parameter value.

For computing diagonal Hessian elements, the central difference formula is used:

$$\frac{\partial^2 f}{\partial \theta_i^2} \approx \frac{f(\theta + h_i e_i) - 2f(\theta) + f(\theta - h_i e_i)}{h_i^2}$$

For mixed partial derivatives:

$$\frac{\partial^2 f}{\partial \theta_i \partial \theta_j} \approx \frac{f(\theta + h_i e_i + h_j e_j) - f(\theta + h_i e_i - h_j e_j) - f(\theta - h_i e_i + h_j e_j) + f(\theta - h_i e_i - h_j e_j)}{4h_i h_j}$$

The algorithm validates numerical differentiation by comparing with existing gradients and adaptively adjusts step sizes when discrepancies are detected.

Stochastic Perturbation

To escape flat regions or local minima, the algorithm implements controlled stochastic perturbation when progress stalls (detected by monitoring gradient norm changes):

$$\theta_i^{new} = \theta_i + \Delta\theta_i$$

where the perturbation $\Delta\theta_i$ combines:

A directed component opposite to the gradient: $-\text{sign}(\nabla \ell_i) \cdot 0.05 \cdot |\theta_i|$
A random noise component: $U(-0.05|\theta_i|, 0.05|\theta_i|)$

The perturbation is applied when:

The relative change in gradient norm is below a threshold for several consecutive iterations
The algorithm appears to be stuck in a non-optimal region

The perturbation is accepted only if it improves the objective function value.

Multi-Start Strategy

For particularly challenging optimization landscapes, the algorithm can employ multiple starting points:

Initial values are generated using moment-based estimation and domain knowledge about each distribution family
Each initial point is randomly perturbed to explore different regions of the parameter space
Independent optimization runs are performed from each starting point
The best result (based on likelihood value and convergence status) is returned

This approach increases the probability of finding the global optimum or a high-quality local optimum, particularly for complex models with many parameters.

Advanced Parameter Initialization

Intelligent starting values are critical for convergence in complex models. The algorithm uses data-driven initialization based on:

Method of moments estimators for beta parameters: $$\alpha + \beta = \frac{\bar{x}(1-\bar{x})}{s^2} - 1$$ $$\alpha = (\alpha + \beta)\bar{x}$$
Transformations to Kumaraswamy parameters: $$a_{Kw} = \sqrt{\alpha_{Beta}}$$ $$b_{Kw} = \sqrt{\beta_{Beta}}$$
Adjustments based on data skewness (detected via mean relative to 0.5)
Corrections based on data dispersion (range relative to (0,1) interval)

The transformations between beta and Kumaraswamy parameters leverage the similarities between these distributions while accounting for their parametric differences.

Hybrid Optimization Strategy

The algorithm can dynamically switch between trust region and Newton-Raphson methods based on optimization progress:

Early iterations: trust region method for stability and global convergence properties
Later iterations (when close to optimum): Newton-Raphson with line search for quadratic convergence rate

The switching criteria are based on iteration count and gradient norm, with additional logic to handle cases where one method consistently fails.

References

Carrasco, J. M. F., Ferrari, S. L. P., & Cordeiro, G. M. (2010). A new generalized Kumaraswamy distribution. arXiv preprint arXiv:1004.0911.

Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.

Conn, A. R., Gould, N. I. M., & Toint, P. L. (2000). Trust Region Methods. MPS-SIAM Series on Optimization.

Kumaraswamy, P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology, 46(1-2), 79-88.

Examples

Run this code

# \donttest{
# Generate sample data from a Beta(2,5) distribution for testing
set.seed(123)
sample_data <- rbeta_(200, 2, 5)

# Automatic initialization (recommended for beginners)
result_auto <- nrgkw(NULL, sample_data, family = "beta", verbose = FALSE)
print(result_auto$parameters)
print(result_auto$loglik)

# Compare different optimization methods
methods <- c("trust-region", "newton-raphson", "hybrid")
results <- list()

for (method in methods) {
  results[[method]] <- nrgkw(NULL, sample_data, family = "beta",
                               optimization_method = method)
  cat(sprintf("Method: %s, AIC: %.4f\n", method, results[[method]]$aic))
}

# Fit the full GKw model with diagnostic information
gkw_result <- nrgkw(NULL, sample_data, family = "gkw",
                      verbose = FALSE, get_num_hess = TRUE)

# Examine parameter identifiability through condition number
cat(sprintf("Condition number: %.2e\n", gkw_result$condition_number))
print(gkw_result)

# Compare with simpler models using information criteria
cat("Information criteria comparison:\n")
cat(sprintf("GKw: AIC=%.4f, BIC=%.4f\n", gkw_result$aic, gkw_result$bic))
cat(sprintf("Beta: AIC=%.4f, BIC=%.4f\n",
           results[["trust-region"]]$aic, results[["trust-region"]]$bic))
# }

Run the code above in your browser using DataLab