sfaselectioncross
is a symbolic formula based function for the
estimation of the stochastic frontier model in the presence of sample
selection. The model accommodates cross-sectional or pooled cross-sectional data.
The model can be estimated using different quadrature approaches or
maximum simulated likelihood (MSL). See Greene (2010).
Only the half-normal distribution is possible for the one-sided error term. Eleven optimization algorithms are available.
The function also accounts for heteroscedasticity in both one-sided and two-sided error terms, as in Reifschneider and Stevenson (1991), Caudill and Ford (1993), Caudill et al. (1995) and Hadri (1999).
sfaselectioncross(
selectionF,
frontierF,
uhet,
vhet,
modelType = "greene10",
logDepVar = TRUE,
data,
subset,
weights,
wscale = TRUE,
S = 1L,
udist = "hnormal",
start = NULL,
method = "bfgs",
hessianType = 2L,
lType = "ghermite",
Nsub = 100,
uBound = Inf,
simType = "halton",
Nsim = 100,
prime = 2L,
burn = 10,
antithetics = FALSE,
seed = 12345,
itermax = 2000,
printInfo = FALSE,
intol = 1e-06,
tol = 1e-12,
gradtol = 1e-06,
stepmax = 0.1,
qac = "marquardt"
)# S3 method for sfaselectioncross
print(x, ...)
# S3 method for sfaselectioncross
bread(x, ...)
# S3 method for sfaselectioncross
estfun(x, ...)
sfaselectioncross
returns a list of class 'sfaselectioncross'
containing the following elements:
The matched call.
The selection equation formula.
The frontier equation formula.
The argument 'S'
. See the section ‘Arguments’.
Character string. 'Stochastic Production/Profit Frontier, e =
v - u' when S = 1
and 'Stochastic Cost Frontier, e = v + u' when
S = -1
.
Number of initial observations in all samples.
Number of observations used for optimization.
Number of explanatory variables in the production or cost frontier.
The argument 'logDepVar'
. See the section
‘Arguments’.
Number of variables explaining heteroscedasticity in the one-sided error term.
Number of variables explaining heteroscedasticity in the two-sided error term.
Total number of parameters estimated.
The argument 'udist'
. See the section
‘Arguments’.
Numeric vector. Starting value for M(S)L estimation.
A data frame (tibble format) containing information on data
used for optimization along with residuals and fitted values of the OLS and
M(S)L estimations, and the individual observation log-likelihood. When argument weights
is specified, an additional variable is provided in dataTable
.
Linear probability model used for initializing the first step probit model.
Probit model. Object of class 'maxLik'
and 'maxim'
.
Numeric vector. OLS second step estimates for selection correction. Inverse Mills Ratio is introduced as an additional explanatory variable.
Numeric vector. Standard errors of OLS second step estimates.
Numeric. Estimated variance of OLS second step random error.
Numeric. Log-likelihood value of OLS second step estimation.
Numeric. Skewness of the residuals of the OLS second step estimation.
Logical. Indicating whether the residuals of the OLS second step estimation have the expected skewness.
Coelli's test for OLS residuals skewness. (See Coelli, 1995).
D'Agostino's test for OLS residuals skewness. (See D'Agostino and Pearson, 1973).
Logical. If TRUE
weighted log-likelihood is
maximized.
Type of likelihood estimated. See the section ‘Arguments’.
Optimization algorithm used.
Number of iterations of the ML estimation.
Optimization algorithm termination message.
Log-likelihood at the starting values.
Log-likelihood value of the M(S)L estimation.
Parameters obtained from M(S)L estimation.
Each variable gradient of the M(S)L estimation.
Matrix. Each variable individual observation gradient of the M(S)L estimation.
Gradient norm of the M(S)L estimation.
Covariance matrix of the parameters obtained from the M(S)L estimation.
The argument 'hessianType'
. See the section
‘Arguments’.
Date and time of the estimated model.
The argument 'simDist'
, only if lType =
'msl'
. See the section ‘Arguments’.
The argument 'Nsim'
, only if lType = 'msl'
.
See the section ‘Arguments’.
Matrix of random draws used for MSL, only if lType =
'msl'
.
List. Gauss-Hermite quadrature rule as provided by
gaussHermiteData
. Only if lType =
'ghermite'
.
Number of subdivisions used for quadrature approaches.
Upper bound for the inefficiency component when solving
integrals using quadrature approaches except Gauss-Hermite for which the upper
bound is automatically infinite (Inf
).
Integration tolerance for quadrature approaches except Gauss-Hermite.
A symbolic (formula) description of the selection equation.
A symbolic (formula) description of the outcome (frontier) equation.
A one-part formula to consider heteroscedasticity in the one-sided error variance (see section ‘Details’).
A one-part formula to consider heteroscedasticity in the two-sided error variance (see section ‘Details’).
Character string. Model used to solve the selection bias. Only the model discussed in Greene (2010) is currently available.
Logical. Informs whether the dependent variable is logged
(TRUE
) or not (FALSE
). Default = TRUE
.
The data frame containing the data.
An optional vector specifying a subset of observations to be used in the optimization process.
An optional vector of weights to be used for weighted log-likelihood.
Should be NULL
or numeric vector with positive values. When NULL
,
a numeric vector of 1 is used.
Logical. When weights
is not NULL
, a scaling transformation
is used such that the weights
sum to the sample size. Default TRUE
.
When FALSE
no scaling is used.
If S = 1
(default), a production (profit) frontier is
estimated: \(\epsilon_i = v_i-u_i\). If S = -1
, a cost frontier is
estimated: \(\epsilon_i = v_i+u_i\).
Character string. Distribution specification for the one-sided
error term. Only the half normal distribution 'hnormal'
is currently
implemented.
Numeric vector. Optional starting values for the maximum likelihood (ML) estimation.
Optimization algorithm used for the estimation. Default =
'bfgs'
. 11 algorithms are available:
'bfgs'
,
for Broyden-Fletcher-Goldfarb-Shanno (see
maxBFGS
)
'bhhh'
, for
Berndt-Hall-Hall-Hausman (see maxBHHH
)
'nr'
, for Newton-Raphson (see maxNR
)
'nm'
, for Nelder-Mead (see maxNM
)
'cg'
, for Conjugate Gradient (see maxCG
)
'sann'
, for Simulated Annealing (see maxSANN
)
'ucminf'
, for a quasi-Newton type optimization with BFGS updating of the
inverse Hessian and soft line search with a trust region type monitoring of
the input to the line search algorithm (see ucminf
)
'mla'
, for general-purpose optimization based on
Marquardt-Levenberg algorithm (see mla
)
'sr1'
, for Symmetric Rank 1 (see
trust.optim
)
'sparse'
, for trust
regions and sparse Hessian (see trust.optim
)
'nlminb'
, for optimization using PORT routines (see
nlminb
)
Integer. If 1
, analytic Hessian is
returned. If 2
, bhhh Hessian is estimated (\(g'g\)). bhhh hessian is
estimated by default as the estimation is conducted in two steps.
Specifies the way the likelihood is estimated. Five possibilities are
available: kronrod
for Gauss-Kronrod quadrature
(see integrate
), hcubature
and
pcubature
for adaptive integration over hypercubes
(see hcubature
and
pcubature
), ghermite
for Gauss-Hermite
quadrature (see gaussHermiteData
), and
msl
for maximum simulated likelihood. Default ghermite
.
Integer. Number of subdivisions/nodes used for quadrature approaches.
Default Nsub = 100
.
Numeric. Upper bound for the inefficiency component when solving
integrals using quadrature approaches except Gauss-Hermite for which the upper
bound is automatically infinite (Inf
). Default uBound = Inf
.
Character string. If simType = 'halton'
(Default),
Halton draws are used for maximum simulated likelihood (MSL). If
simType = 'ghalton'
, Generalized-Halton draws are used for MSL. If
simType = 'sobol'
, Sobol draws are used for MSL. If simType =
'uniform'
, uniform draws are used for MSL. (see section ‘Details’).
Number of draws for MSL (default 100).
Prime number considered for Halton and Generalized-Halton
draws. Default = 2
.
Number of the first observations discarded in the case of Halton
draws. Default = 10
.
Logical. Default = FALSE
. If TRUE
,
antithetics counterpart of the uniform draws is computed. (see section
‘Details’).
Numeric. Seed for the random draws.
Maximum number of iterations allowed for optimization.
Default = 2000
.
Logical. Print information during optimization. Default =
FALSE
.
Numeric. Integration tolerance for quadrature approaches
(kronrod, hcubature, pcubature
).
Numeric. Convergence tolerance. Default = 1e-12
.
Numeric. Convergence tolerance for gradient. Default =
1e-06
.
Numeric. Step max for ucminf
algorithm. Default =
0.1
.
Character. Quadratic Approximation Correction for 'bhhh'
and 'nr'
algorithms. If 'stephalving'
, the step length is
decreased but the direction is kept. If 'marquardt'
(default), the
step length is decreased while also moving closer to the pure gradient
direction. See maxBHHH
and
maxNR
.
an object of class sfaselectioncross (returned by the function sfaselectioncross
).
additional arguments of frontier are passed to sfaselectioncross; additional arguments of the print, bread, estfun, nobs methods are currently ignored.
The current model is an extension of Heckman (1976, 1979) sample selection model to nonlinear models particularly stochastic frontier model. The model has first been discussed in Greene (2010), and an application can be found in Dakpo et al. (2021). Practically, we have:
$$ y_{1i} = \left\{ \begin{array}{ll} 1 & \mbox{if} \quad y_{1i}^* > 0 \\ 0 & \mbox{if} \quad y_{1i}^* \leq 0 \\ \end{array} \right. $$
where
$$ y_{1i}^*=\mathbf{Z}_{si}^{\prime} \mathbf{\gamma} + w_i, \quad w_i \sim \mathcal{N}(0, 1) $$
and
$$ y_{2i} = \left\{ \begin{array}{ll} y_{2i}^* & \mbox{if} \quad y_{1i}^* > 0 \\ NA & \mbox{if} \quad y_{1i}^* \leq 0 \\ \end{array} \right. $$
where
$$ y_{2i}^*=\mathbf{x_{i}^{\prime}} \mathbf{\beta} + v_i - Su_i, \quad v_i = \sigma_vV_i \quad \wedge \quad V_i \sim \mathcal{N}(0, 1), \quad u_i = \sigma_u|U_i| \quad \wedge \quad U_i \sim \mathcal{N}(0, 1) $$
\(y_{1i}\) describes the selection equation while \(y_{2i}\) represents the frontier equation. The selection bias arises from the correlation between the two symmetric random components \(v_i\) and \(w_i\):
$$ (v_i, w_i) \sim \mathcal{N}_2\left\lbrack(0,0), (1, \rho \sigma_v, \sigma_v^2) \right\rbrack $$
Conditionaly on \(|U_i|\), the probability associated to each observation is:
$$ Pr \left\lbrack y_{1i}^* \leq 0 \right\rbrack^{1-y_{1i}} \cdot \left\lbrace f(y_{2i}|y_{1i}^* > 0) \times Pr\left\lbrack y_{1i}^* > 0 \right\rbrack \right\rbrace^{y_{1i}} $$
Using the conditional probability formula:
$$ P\left(A\cap B\right) = P(A) \cdot P(B|A) = P(B) \cdot P(A|B) $$
Therefore:
$$ f(y_{2i}|y_{1i}^* \geq 0) \cdot Pr\left\lbrack y_{1i}^* \geq 0\right\rbrack = f(y_{2i}) \cdot Pr(y_{1i}^* \geq 0|y_{2i}) $$
Using the properties of a bivariate normal distribution, we have:
$$ y_{i1}^* | y_{i2} \sim N\left(\mathbf{Z_{si}^{\prime}} \bm{\gamma}+\frac{\rho}{ \sigma_v}v_i, 1-\rho^2\right) $$
Hence conditionally on \(|U_i|\), we have:
$$ f(y_{2i}|y_{1i}^* \geq 0) \cdot Pr\left\lbrack y_{1i}^* \geq 0\right\rbrack = \frac{1}{\sigma_v}\phi\left(\frac{v_i}{\sigma_v}\right)\Phi\left(\frac{ \mathbf{Z_{si}^{\prime}} \bm{\gamma}+\frac{\rho}{\sigma_v}v_i}{ \sqrt{1-\rho^2}}\right) $$
The conditional likelihood is equal to:
$$ L_i\big||U_i| = \Phi(-\mathbf{Z_{si}^{\prime}} \bm{\gamma})^{1-y_{1i}} \times \left\lbrace \frac{1}{\sigma_v}\phi\left(\frac{y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|}{\sigma_v}\right)\Phi\left(\frac{ \mathbf{Z_{si}^{\prime}} \bm{\gamma}+\frac{\rho}{\sigma_v}\left(y_{2i}- \mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|\right)}{\sqrt{1-\rho^2}} \right) \right\rbrace ^{y_{1i}} $$
Since the non-selected observations bring no additional information, the conditional likelihood to be considered is:
$$ L_i\big||U_i| = \frac{1}{\sigma_v}\phi\left(\frac{y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|}{\sigma_v}\right) \Phi\left(\frac{\mathbf{Z_{si}^{\prime}} \bm{\gamma}+\frac{\rho}{\sigma_v}\left(y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|\right)}{\sqrt{1-\rho^2}}\right) $$
The unconditional likelihood is obtained by integrating \(|U_i|\) out of the conditional likelihood. Thus
$$ L_i\\ = \int_{|U_i|} \frac{1}{\sigma_v}\phi\left(\frac{y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|}{\sigma_v}\right) \Phi\left(\frac{\mathbf{Z_{si}^{\prime}} \bm{\gamma}+ \frac{\rho}{\sigma_v}\left(y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|\right)}{\sqrt{1-\rho^2}}\right)p\left(|U_i|\right)d|U_i| $$
To simplifiy the estimation, the likelihood can be estimated using a two-step approach. In the first step, the probit model can be run and estimate of \(\gamma\) can be obtained. Then, in the second step, the following model is estimated:
$$ L_i\\ = \int_{|U_i|} \frac{1}{\sigma_v}\phi\left(\frac{y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|}{\sigma_v}\right) \Phi\left(\frac{a_i + \frac{\rho}{\sigma_v}\left(y_{2i}-\mathbf{x_{i}^{\prime}} \bm{\beta} + S\sigma_u|U_i|\right)}{\sqrt{1-\rho^2}}\right)p\left(|U_i|\right)d|U_i| $$
where \(a_i = \mathbf{Z_{si}^{\prime}} \hat{\bm{\gamma}}\). This likelihood can be estimated using five different approaches: Gauss-Kronrod quadrature, adaptive integration over hypercubes (hcubature and pcubature), Gauss-Hermite quadrature, and maximum simulated likelihood. We also use the BHHH estimator to obtain the asymptotic standard errors for the parameter estimators.
sfaselectioncross
allows for the maximization of weighted log-likelihood.
When option weights
is specified and wscale = TRUE
, the weights
are scaled as:
$$ new_{weights} = sample_{size} \times \frac{old_{weights}}{\sum(old_{weights})} $$
For complex problems, non-gradient methods (e.g. nm
or sann
) can be
used to warm start the optimization and zoom in the neighborhood of the
solution. Then a gradient-based methods is recommended in the second step. In the case
of sann
, we recommend to significantly increase the iteration limit
(e.g. itermax = 20000
). The Conjugate Gradient (cg
) can also be used
in the first stage.
A set of extractor functions for fitted model objects is available for objects of class
'sfaselectioncross'
including methods to the generic functions print
,
summary
, coef
,
fitted
, logLik
,
residuals
, vcov
,
efficiencies
, ic
,
marginal
,
estfun
and
bread
(from the sandwich package),
lmtest::coeftest()
(from the lmtest package).
Caudill, S. B., and Ford, J. M. 1993. Biases in frontier estimation due to heteroscedasticity. Economics Letters, 41(1), 17--20.
Caudill, S. B., Ford, J. M., and Gropper, D. M. 1995. Frontier estimation and firm-specific inefficiency measures in the presence of heteroscedasticity. Journal of Business & Economic Statistics, 13(1), 105--111.
Coelli, T. 1995. Estimators and hypothesis tests for a stochastic frontier function - a Monte-Carlo analysis. Journal of Productivity Analysis, 6:247--268.
D'Agostino, R., and E.S. Pearson. 1973. Tests for departure from normality. Empirical results for the distributions of \(b_2\) and \(\sqrt{b_1}\). Biometrika, 60:613--622.
Dakpo, K. H., Latruffe, L., Desjeux, Y., Jeanneaux, P., 2022. Modeling heterogeneous technologies in the presence of sample selection: The case of dairy farms and the adoption of agri-environmental schemes in France. Agricultural Economics, 53(3), 422-438.
Greene, W., 2010. A stochastic frontier model with correction for sample selection. Journal of Productivity Analysis. 34, 15--24.
Hadri, K. 1999. Estimation of a doubly heteroscedastic stochastic frontier cost function. Journal of Business & Economic Statistics, 17(3), 359--363.
Heckman, J., 1976. Discrete, qualitative and limited dependent variables. Ann Econ Soc Meas. 4, 475--492.
Heckman, J., 1979. Sample Selection Bias as a Specification Error. Econometrica. 47, 153--161.
Reifschneider, D., and Stevenson, R. 1991. Systematic departures from the frontier: A framework for the analysis of firm inefficiency. International Economic Review, 32(3), 715--723.
print
for printing sfaselectioncross
object.
summary
for creating and printing
summary results.
coef
for extracting coefficients of the
estimation.
efficiencies
for computing
(in-)efficiency estimates.
fitted
for extracting the fitted frontier
values.
ic
for extracting information criteria.
logLik
for extracting log-likelihood
value(s) of the estimation.
marginal
for computing marginal effects of
inefficiency drivers.
residuals
for extracting residuals of the
estimation.
vcov
for computing the variance-covariance
matrix of the coefficients.
bread
for bread for sandwich estimator.
estfun
for gradient extraction for each
observation.
if (FALSE) {
## Simulated example
N <- 2000 # sample size
set.seed(12345)
z1 <- rnorm(N)
z2 <- rnorm(N)
v1 <- rnorm(N)
v2 <- rnorm(N)
e1 <- v1
e2 <- 0.7071 * (v1 + v2)
ds <- z1 + z2 + e1
d <- ifelse(ds > 0, 1, 0)
u <- abs(rnorm(N))
x1 <- rnorm(N)
x2 <- rnorm(N)
y <- x1 + x2 + e2 - u
data <- cbind(y = y, x1 = x1, x2 = x2, z1 = z1, z2 = z2, d = d)
## Estimation using quadrature (Gauss-Kronrod)
selecRes1 <- sfaselectioncross(selectionF = d ~ z1 + z2, frontierF = y ~ x1 + x2,
modelType = 'greene10', method = 'bfgs',
logDepVar = TRUE, data = as.data.frame(data),
S = 1L, udist = 'hnormal', lType = 'kronrod', Nsub = 100, uBound = Inf,
simType = 'halton', Nsim = 300, prime = 2L, burn = 10, antithetics = FALSE,
seed = 12345, itermax = 2000, printInfo = FALSE)
summary(selecRes1)
## Estimation using maximum simulated likelihood
selecRes2 <- sfaselectioncross(selectionF = d ~ z1 + z2, frontierF = y ~ x1 + x2,
modelType = 'greene10', method = 'bfgs',
logDepVar = TRUE, data = as.data.frame(data),
S = 1L, udist = 'hnormal', lType = 'msl', Nsub = 100, uBound = Inf,
simType = 'halton', Nsim = 300, prime = 2L, burn = 10, antithetics = FALSE,
seed = 12345, itermax = 2000, printInfo = FALSE)
summary(selecRes2)
}
Run the code above in your browser using DataLab