hpaBinary: Semi-nonparametric single index binary choice model estimation

Description

This function performs semi-nonparametric (SNP) maximum likelihood estimation of single index binary choice model using Hermite polynomial based approximating function proposed by Gallant and Nychka in 1987. Please, see dhpa 'Details' section to get more information concerning this approximating function.

Usage

hpaBinary(
  formula,
  data,
  K = 1L,
  mean_fixed = NA_real_,
  sd_fixed = NA_real_,
  constant_fixed = 0,
  coef_fixed = TRUE,
  is_x0_probit = TRUE,
  is_sequence = FALSE,
  x0 = numeric(0),
  cov_type = "sandwich",
  boot_iter = 100L,
  is_parallel = FALSE,
  opt_type = "optim",
  opt_control = NULL,
  is_validation = TRUE
)

Value

This function returns an object of class "hpaBinary".

An object of class "hpaBinary" is a list containing the following components:

optim - optim function output. If opt_type = "GA" then it is the list containing optim and ga functions outputs.
x1 - numeric vector of distribution parameters estimates.
mean - mean (mu) parameter of density function estimate.
sd - sd (sigma) parameter of density function estimate.
pol_coefficients - polynomial coefficients estimates.
pol_degrees - the same as K input parameter.
coefficients - regression (single index) coefficients estimates.
cov_mat - covariance matrix estimate.
marginal_effects - marginal effects matrix where columns are variables and rows are observations.
results - numeric matrix representing estimation results.
log-likelihood - value of Log-Likelihood function.
AIC - AIC value.
errors_exp - random error expectation estimate.
errors_var - random error variance estimate.
dataframe - data frame containing variables mentioned in formula without NA values.
model_Lists - lists containing information about fixed parameters and parameters indexes in x1.
n_obs - number of observations.
z_latent - latent variable (single index) estimates.
z_prob - probabilities of positive outcome (i.e. 1) estimates.

Arguments

formula: an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. All variables in formula should be numeric vectors of the same length.
data: data frame containing the variables in the model.
K: non-negative integer representing polynomial degree (order).
mean_fixed: numeric value for binary choice equation random error density mean parameter. Set it to NA (default) if this parameter should be estimated rather than fixed.
sd_fixed: numeric value for binary choice equation random error density sd parameter. Set it to NA (default) if this parameter should be estimated rather than fixed.
constant_fixed: numeric value for binary choice equation constant parameter. Set it to NA (default) if this parameter should be estimated rather than fixed.
coef_fixed: logical value indicating whether binary equation first independent variable coefficient should be fixed (TRUE) or estimated (FALSE).
is_x0_probit: logical; if TRUE (default) then initial points for optimization routine will be obtained by probit model estimated via glm function.
is_sequence: if TRUE then function calculates models with polynomial degrees from 0 to K each time using initial values obtained from the previous step. In this case function will return the list of models where i-th list element correspond to model calculated under K=(i-1).
x0: numeric vector of optimization routine initial values. Note that x0 = c(pol_coefficients[-1], mean, sd, coefficients).
cov_type: character determining the type of covariance matrix to be returned and used for summary. If cov_type = "hessian" then negative inverse of Hessian matrix will be applied. If cov_type = "gop" then inverse of Jacobian outer products will be used. If cov_type = "sandwich" (default) then sandwich covariance matrix estimator will be applied. If cov_type = "bootstrap" then bootstrap with boot_iter iterations will be used. If cov_type = "hessianFD" or cov_type = "sandwichFD" then (probably) more accurate but computationally demanding central difference Hessian approximation will be calculated for the inverse Hessian and sandwich estimators correspondingly. Central differences are computed via analytically provided gradient. This Hessian matrix estimation approach seems to be less accurate than BFGS approximation if polynomial order is high (usually greater then 5).
boot_iter: the number of bootstrap iterations for cov_type = "bootstrap" covariance matrix estimator type.
is_parallel: if TRUE then multiple cores will be used for some calculations. It usually provides speed advantage for large enough samples (about more than 1000 observations).
opt_type: string value determining the type of the optimization routine to be applied. The default is "optim" meaning that BFGS method from the optim function will be applied. If opt_type = "GA" then ga function will be additionally applied.
opt_control: a list containing arguments to be passed to the optimization routine depending on opt_type argument value. Please see details to get additional information.
is_validation: logical value indicating whether function input arguments should be validated. Set it to FALSE for slight performance boost (default value is TRUE).

Details

Densities Hermite polynomial approximation approach has been proposed by A. Gallant and D. W. Nychka in 1987. The main idea is to approximate unknown distribution density with scaled Hermite polynomial. For more information please refer to the literature listed below.

Let's use notations introduced in dhpa 'Details' section. Function hpaBinary maximizes the following quasi log-likelihood function: $$\ln L(\gamma_{0}, \gamma, \alpha, \mu, \sigma; x) = \sum\limits_{i:z_{i}=1} \ln\left(\overline{F}_{\xi} (-(\gamma_{0}+\gamma x_{i}), \infty;\alpha, \mu, \sigma)\right) +$$ $$ +\sum\limits_{i:z_{i}=0} \ln\left(\overline{F}_{\xi} (-\infty, -(\gamma_{0} + x_{i}\gamma);\alpha, \mu, \sigma)\right),$$

where (in addition to previously defined notations):

$x_{i}$ - is row vector of regressors derived from data according to formula.

$\gamma$ - is column vector of regression coefficients.

$\gamma_{0}$ - constant.

$z_{i}$ - binary (0 or 1) dependent variable defined in formula.

Note that $\xi$ is one dimensional and K corresponds to $K=K_{1}$.

The first polynomial coefficient (zero powers) set to 1 for identification purposes i.e. $\alpha_{0}=1$.

If coef_fixed is TRUE then the coefficient for the first independent variable in formula will be fixed to 1 i.e. $\gamma_{1}=1$.

If mean_fixed is not NA then $\mu$=mean_fixed fixed.

If sd_fixed is not NA then $\sigma$=mean_fixed fixed. However if is_x0_probit = TRUE then parameter $\sigma$ will be scale adjusted in order to provide better initial point for optimization routine. Please, extract $\sigma$ adjusted value from the function's output list. The same is for mean_fixed.

Rows in data corresponding to variables mentioned in formula which have at least one NA value will be ignored.

All variables mentioned in formula should be numeric vectors.

The function calculates standard errors via sandwich estimator and significance levels are reported taking into account quasi maximum likelihood estimator (QMLE) asymptotic normality. If one wants to switch from QMLE to semi-nonparametric estimator (SNPE) during hypothesis testing then covariance matrix should be estimated again using bootstrap.

This function maximizes (quasi) log-likelihood function via optim function setting its method argument to "BFGS". If opt_type = "GA" then genetic algorithm from ga function will be additionally (after optim putting its solution (par) into suggestions matrix) applied in order to perform global optimization. Note that global optimization takes much more time (usually minutes but sometimes hours or even days). The number of iterations and population size of the genetic algorithm will grow linearly along with the number of estimated parameters. If it seems that global maximum has not been found then it is possible to continue the search restarting the function setting its input argument x0 to x1 output value. Note that if cov_type = "bootstrap" then ga function will not be used for bootstrap iterations since it may be extremely time consuming.

If opt_type = "GA" then opt_control should be the list containing the values to be passed to ga function. It is possible to pass arguments lower, upper, popSize, pcrossover, pmutation, elitism, maxiter, suggestions, optim, optimArgs, seed and monitor. Note that it is possible to set population, selection, crossover and mutation arguments changing ga default parameters via gaControl function. These arguments information reported in ga. In order to provide manual values for lower and upper bounds please follow parameters ordering mentioned above for the x0 argument. If these bounds are not provided manually then they (except those related to the polynomial coefficients) will depend on the estimates obtained by local optimization via optim function (this estimates will be in the middle between lower and upper). Specifically for each sd parameter lower (upper) bound is 5 times lower (higher) than this parameter optim estimate. For each mean and regression coefficient parameter its lower and upper bounds deviate from corresponding optim estimate by two absolute values of this estimate. Finally, lower and upper bounds for each polynomial coefficient are -10 and 10 correspondingly (do not depend on their optim estimates).

The following arguments are differ from their defaults in ga:

pmutation = 0.2,
optim = TRUE,
optimArgs = list("method" = "Nelder-Mead", "poptim" = 0.2, "pressel" = 0.5),
seed = 8,
elitism = 2 + round(popSize * 0.1).

Let's denote by n_reg the number of regressors included into the formula. The arguments popSize and maxiter of ga function have been set proportional to the number of estimated polynomial coefficients and independent variables:

popSize = 10 + 5 * (K + 1) + 2 * n_reg
maxiter = 50 * (1 + K) + 10 * n_reg

References

A. Gallant and D. W. Nychka (1987) <doi:10.2307/1913241>

Examples

Run this code

# \donttest{
## Estimate survival probability on Titanic

library("titanic")

# Prepare data set converting  
# all variables to numeric vectors
h <- data.frame("male" = as.numeric(titanic_train$Sex == "male"))
	h$class_1 <- as.numeric(titanic_train$Pclass == 1)
	h$class_2 <- as.numeric(titanic_train$Pclass == 2)
	h$class_3 <- as.numeric(titanic_train$Pclass == 3)
	h$sibl <- titanic_train$SibSp
	h$survived <- titanic_train$Survived
	h$age <- titanic_train$Age
	h$parch <- titanic_train$Parch
	h$fare <- titanic_train$Fare
	
# Estimate model parameters
model_hpa_1 <- hpaBinary(survived ~class_1 + class_2 +
	male + age + sibl + parch + fare,
	K = 3, data = h)
#get summary
summary(model_hpa_1)

# Get predicted probabilities
pred_hpa_1 <- predict(model_hpa_1)

# Calculate number of correct predictions
hpa_1_correct_0 <- sum((pred_hpa_1 < 0.5) & 
                       (model_hpa_1$dataframe$survived == 0))
hpa_1_correct_1 <- sum((pred_hpa_1 >= 0.5) & 
                       (model_hpa_1$dataframe$survived == 1))
hpa_1_correct <- hpa_1_correct_1 + hpa_1_correct_0

# Plot random errors density approximation
plot(model_hpa_1)
# }

# \donttest{
## Estimate parameters on data simulated from Student distribution

library("mvtnorm")
set.seed(123)

# Simulate independent variables from normal distribution
n <- 5000
X <- rmvnorm(n=n, mean = c(0,0), 
sigma = matrix(c(1,0.5,0.5,1), ncol=2))

# Simulate random errors from Student distribution
epsilon <- rt(n, 5) * (3 / sqrt(5))

# Calculate latent and observable variables values
z_star <- 1 + X[, 1] + X[, 2] + epsilon
z <- as.numeric((z_star > 0))

# Store the results into data frame
h <- as.data.frame(cbind(z,X))
names(h) <- c("z", "x1", "x2")

# Estimate model parameters
model <- hpaBinary(formula = z ~ x1 + x2, data=h, K = 3)
summary(model)

# Get predicted probabilities of 1 values
predict(model)

# Plot density function approximation
plot(model)
# }

# \dontshow{
## Estimate parameters on data simulated from Student distribution
set.seed(777)

# Simulate independent variables from uniform
n <- 200
X <- matrix(runif(2 * n, -2, 2), ncol = 2)

# Simulate random errors from Student distribution
epsilon <- rt(n, 5) * (3 / sqrt(5))

# Calculate latent and observable variables values
z_star <- 1 + X[, 1] + X[, 2] + epsilon
z <- as.numeric((z_star > 0))

# Store the results into data frame
h <- as.data.frame(cbind(z, X))
names(h) <- c("z", "x1", "x2")

# Estimate model parameters
model <- hpaBinary(formula = z ~ x1 + x2, data=h, K = 3)
summary(model)

# Get predicted probabilities of 1 values
predict(model)

# Plot random errors density approximation
plot(model)
# }

Run the code above in your browser using DataLab