selection: Heckman-style selection models

Description

This is the frontend for estimating Heckman-style selection models either with one or two outcomes (also known as generalized tobit models). It supports binary outcomes in the single-outcome case.

For model specification and more details, see Henningsen and Toomet (2008) and the included vignette “Sample Selection Models”.

Usage

selection(selection, outcome, data = sys.frame(sys.parent()), weights = NULL, subset, method = "ml", start = NULL,  ys = FALSE, xs = FALSE, yo = FALSE, xo = FALSE, mfs = FALSE, mfo = FALSE, print.level = 0, ...)
heckit( selection, outcome, data = sys.frame(sys.parent()), method = "2step", ... )

Arguments

selection

formula, the selection equation.

outcome

the outcome equation(s). Either a single equation (for tobit 2 models), or a list of two equations (tobit 5 models).

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which selection is called.

weights

an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector. Weights are currently only supported in type-2 models.

subset

an optional index vector specifying a subset of observations to be used in the fitting process.

method

how to estimate the model. Either "ml" for Maximum Likelihood, "2step" for 2-step estimation, or "model.frame" for returning the model frame (only).

start

vector, initial values for the ML estimation. If start does not have names, names are constructed based on the model frame.

ys, yo, xs, xo, mfs, mfo

logicals. If true, the response (y), model matrix (x) or the model frame (mf) of the selection (s) or outcome (o) equation(s) are returned.

print.level

integer. Various debugging information, higher value gives more information.

...

additional parameters for the corresponding fitting functions tobit2fit, tobit5fit, heckit2fit, heckit5fit, and tobit2Bfit.

Value

probit: object of class 'probit' that contains the results of the 1st step (probit estimation) (only for two-step estimations).
twoStep: (only if initial values not given) results of the 2-step estimation, used for initial values
start: initial values for ML estimation
termsS, termsO: terms for the selection and outcome equation
ys, xs, yo, xo, mfs, mfo: response, matrix and frame of the selection- and outcome equations (as a list of two for the latter). NULL, if not requested. The response is represented internally as 0/1 integer vector with 0 denoting either the unobservable outcome (tobit 2) or the first selection (tobit 5).
coefficients: estimated coefficients, the complete model. coefficient for the Inverse Mills ratio is treated as a parameter ($= rho * sigma$).
vcov: variance covariance matrix of the estimated coefficients.
param: a list with following components: index, a list of numeric vectors: where in the coef the component are located; oIntercept, a logical: whether the outcome equation includes intercept; N0, N1, integer, number of observations with unobserved and observed outcomes; nObs, integer, number of valid observations; nParam, integer, number of the parameters in the model (not all are independent); df, integer, degrees of freedom. Note this is not equal to nObs - nParam because of the parameters are not independent in all the cases; levels levels for the response of the selection equation. levels[1] corresponds to the outcome 1, levels[2] to the outcome 2.
lm, lm1, lm2: objects of class 'lm' that contain the results of the 2nd step estimation(s) of the outcome equation(s). Note: the standard errors of this estimation are biased, because they do not account for the estimation of $\gamma$ in the 1st step estimation (the correct standard errors are returned by summary and they are contained in vcov component).
sigma, sigma1, sigma2: the standard error(s) of the error terms of the outcome equation(s).
rho, rho1, rho2: the estimated correlation coefficient(s) between the error term of the selection equation and the outcome equation(s).
invMillsRatio: the inverse Mills Ratios calculated from the results of the 1st step probit estimation.
imrDelta: the $\delta$s calculated from the inverse Mills Ratios and the results of the 1st step probit estimation.
binaryOutcome: logical value indicating whether the dependent variable of the outcome equation is binary.

Details

The endogenous variable of the argument 'selection' must have exactly two levels (e.g. 'FALSE' and 'TRUE', or '0' and '1'). By default the levels are sorted in increasing order ('FALSE' is before 'TRUE', and '0' is before '1'). This also applies for the binary outcome equation. For continuous-oucome cases, the dependent variable(s) should be numeric.

For tobit-2 (sample selection) models, only those observations are included in the second step estimation (argument 'outcome'), where this variable equals the second element of its levels (e.g. 'TRUE' or '1').

For tobit-5 (switching regression) models, in the second step the first outcome equation (first element of argument 'outcome') is estimated only for those observations, where this endogenous variable of the selections equation equals the first element of its levels (e.g. 'FALSE' or '0'). The second outcome equation is estimated only for those observations, where this variable equals the second element of its levels (e.g. 'TRUE' or '1').

NA-s are allowed in the data. These are ignored if the corresponding outcome is unobserved, otherwise observations which contain NA (either in selection or outcome) are removed. These selection models assume a known (multivariate normal) distribution of error terms. Because of this, the instruments (exclusion restrictions) are not necessary. However, if no instruments are supplied, the results are based solely on the assumption on multivariate normality. This may or may not be an appropriate assumption for a particular problem. Note also that standard errors tend to be large without a strong exclusion restriction. If argument method is equal to "ml" (the default), the estimation is done by the maximum likelihood method, where the Newton-Raphson algorithm is used by default. Argument maxMethod (see tobit2fit) can be used to chose other algorithms for the maximisation of the (log) likelihood function.

Methods that can be applied to objects returned by selection() are described on the help page selection-methods.

References

Cameron, A. C. and Trivedi, P. K. (2005) Microeconometrics: Methods and Applications, Cambridge University Press.

Greene, W. H. (2003) Econometric Analysis, Fifth Edition, Prentice Hall.

Heckman, J. (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models, Annals of Economic and Social Measurement, 5(4), p. 475-492.

Johnston, J. and J. DiNardo (1997) Econometric Methods, Fourth Edition, McGraw-Hill.

Lee, L., G. Maddala and R. Trost (1980) Asymetric covariance matrices of two-stage probit and two-stage tobit methods for simultaneous equations models with selectivity. Econometrica, 48, p. 491-503.

Toomet, O. and A. Henningsen, (2008) Sample Selection Models in R: Package sampleSelection. Journal of Statistical Software 27(7), http://www.jstatsoft.org/v27/i07/

Wooldridge, J. M. (2003) Introductory Econometrics: A Modern Approach, 2e, Thomson South-Western.

Examples

Run this code

## Greene( 2003 ): example 22.8, page 786
data( Mroz87 )
Mroz87$kids  <- ( Mroz87$kids5 + Mroz87$kids618 > 0 )
# Two-step estimation
summary( heckit( lfp ~ age + I( age^2 ) + faminc + kids + educ,
   wage ~ exper + I( exper^2 ) + educ + city, Mroz87 ) )
# ML estimation
summary( selection( lfp ~ age + I( age^2 ) + faminc + kids + educ,
   wage ~ exper + I( exper^2 ) + educ + city, Mroz87 ) )

## Example using binary outcome for selection model.
## We estimate the probability of womens' education on their
## chances to get high wage (> $5/hr in 1975 USD), using PSID data
## We use education as explanatory variable
## and add age, kids, and non-work income as exclusion restrictions.
data(Mroz87)
m <- selection(lfp ~ educ + age + kids5 + kids618 + nwifeinc,
   wage >= 5 ~ educ, data = Mroz87 )
summary(m)


## example using random numbers
library( "mvtnorm" )
nObs <- 1000
sigma <- matrix( c( 1, -0.7, -0.7, 1 ), ncol = 2 )
errorTerms <- rmvnorm( nObs, c( 0, 0 ), sigma )
myData <- data.frame( no = c( 1:nObs ), x1 = rnorm( nObs ), x2 = rnorm( nObs ),
   u1 = errorTerms[ , 1 ], u2 =  errorTerms[ , 2 ] )
myData$y <- 2 + myData$x1 + myData$u1
myData$s <- ( 2 * myData$x1 + myData$x2 + myData$u2 - 0.2 ) > 0
myData$y[ !myData$s ] <- NA
myOls <- lm( y ~ x1, data = myData)
summary( myOls )
myHeckit <- heckit( s ~ x1 + x2, y ~ x1, myData, print.level = 1 )
summary( myHeckit )

## example using random numbers with IV/2SLS estimation
library( "mvtnorm" )
nObs <- 1000
sigma <- matrix( c( 1, 0.5, 0.1, 0.5, 1, -0.3, 0.1, -0.3, 1 ), ncol = 3 )
errorTerms <- rmvnorm( nObs, c( 0, 0, 0 ), sigma )
myData <- data.frame( no = c( 1:nObs ), x1 = rnorm( nObs ), x2 = rnorm( nObs ),
   u1 = errorTerms[ , 1 ], u2 = errorTerms[ , 2 ], u3 = errorTerms[ , 3 ] )
myData$w <- 1 + myData$x1 + myData$u1
myData$y <- 2 + myData$w + myData$u2
myData$s <- ( 2 * myData$x1 + myData$x2 + myData$u3 - 0.2 ) > 0
myData$y[ !myData$s ] <- NA
myHeckit <- heckit( s ~ x1 + x2, y ~ w, data = myData )
summary( myHeckit )  # biased!
myHeckitIv <- heckit( s ~ x1 + x2, y ~ w, data = myData, inst = ~ x1 )
summary( myHeckitIv ) # unbiased

## tobit-5 example
N <- 500
   library(mvtnorm)
   vc <- diag(3)
   vc[lower.tri(vc)] <- c(0.9, 0.5, 0.6)
   vc[upper.tri(vc)] <- vc[lower.tri(vc)]
   eps <- rmvnorm(N, rep(0, 3), vc)
   xs <- runif(N)
   ys <- xs + eps[,1] > 0
   xo1 <- runif(N)
   yo1 <- xo1 + eps[,2]
   xo2 <- runif(N)
   yo2 <- xo2 + eps[,3]
   a <- selection(ys~xs, list(yo1 ~ xo1, yo2 ~ xo2))
   summary(a)

## tobit2 example
   vc <- diag(2)
   vc[2,1] <- vc[1,2] <- -0.7
   eps <- rmvnorm(N, rep(0, 2), vc)
   xs <- runif(N)
   ys <- xs + eps[,1] > 0
   xo <- runif(N)
   yo <- (xo + eps[,2])*(ys > 0)
   a <- selection(ys~xs, yo ~xo)
   summary(a)

Run the code above in your browser using DataLab