mvoprobit: Multivariate ordered probit model with heteroscedasticity and (non-random) sample selection.

Description

This function allows to estimate parameters of multivariate ordered probit model and its extensions. It is possible to account for heteroscedastic variances, non-normal marginal distributions of random errors (under Gaussian copula) and (non-random) sample selection i.e. when some categories of particular dependent variables are observable only under some specific values of other dependent variables. Also it is possible to include continuous equations to get multivariate generalization of endogenous switching model. In this case both maximum-likelihood and two-step (similar to Heckman's method) estimation procedures are implemented.

Usage

mvoprobit(
  formula,
  formula2 = NULL,
  data = NULL,
  groups = NULL,
  groups2 = NULL,
  marginal = list(),
  opt_type = "optim",
  opt_args = NULL,
  start = NULL,
  estimator = "ml",
  cov_type = ifelse(estimator == "ml", "sandwich", "parametric"),
  degrees = NULL,
  n_sim = 1000,
  n_cores = 1,
  control = NULL,
  regularization = NULL
)

Value

This function returns an object of class 'mvoprobit' which is a list containing the following elements:

par - vector of parameters' estimates.
coef - list which j-the element coef[[j]] is a vector of regression coefficients estimates of the j-th ordered equation i.e. $\hat{\gamma}_{j}$.
coef_var - list which j-the element coef_var[[j]] is a vector of regression coefficients estimates of the variance part of the j-th ordered equation i.e. $\hat{\gamma}_{j}^{*}$.
coef2 - list which j-the element coef2[[j]] is a matrix of regression coefficients estimates of the j-th continuous equation. Wherein i-th row of this matrix contains estimates of regression coefficients corresponding to the i-th regime of j-th continuous variable.
sigma - estimate of the covariance matrix of random errors of ordered equations i.e. $\widehat{\Sigma}$.
var2 - estimates of the variances of random errors of continuous equations.
sigma2 - estimates of covariances between random errors of continuous equations.
cov2 - list which j-th element cov_y[[j]] contains estimates of covariances between random errors of j-th continuous equation in different regimes.
cuts - list which j-the element cuts[[j]] is a vector of cuts estimates of the j-th equation i.e. $\hat{c}_{j}$.
ind - list containing some indexes partition of the model (not intended for users).
logLik - log-likelihood value.
regressors - numeric matrix which j-th element regressors[[j]] is a regressors matrix of the j-th equation i.e. $w_{j}$.
regressors2 - list which j-th element regressors2[[j]] is a regressors matrix of the j-th variance equation i.e. $w_{j}^{*}$.
dependent - numeric matrix which j-th column dependent[, j] is a vector of dependent variable $z_{j}$ values.
control_lnL - some additional variables to be passed to likelihood function (not intended for users).
formula - the same as formula input argument but all elements are coerced to formula type.
lambda - matrix such that lambda[i, j] corresponds to $\hat{\lambda}_{ij}$. predict.mvoprobit for more information.
data_list - list which j-th element data_list[[j]] is a dataframe containing regressors and dependent variable of the j-th equation.
data - the same as data input argument but without missing values.
cov - estimate of the covariance matrix of parameters' estimator.
cov_type - type of the asymptotic covariance matrix estimator.
cov_2step - estimate of the covariance matrix of parameters' estimator associated with the second step parameters i.e. when estimator = "2step".
sd - standard errors of the estimates.
p_value - p-values of the tests on significance of the parameters where null hypothesis is that corresponding parameter equals zero.
tbl - special table used to create a summary (not intended for users).
groups - the same as groups input argument or automatically generated matrix representing the structure of the system of equations. Please, see 'Details' section above for more information.
groups2 - the same as groups2 input argument or automatically generated matrix representing the structure of the system of equations. Please, see 'Details' section above for more information.
marginal - the same as marginal input argument.
degrees - the same as degrees input argument.
model1 - first step estimation results when estimator = "2step".
coef_lambda - estimates of coefficients of lambdas.
other - list of additional variables not intended for the user.

It is highly recommended to get estimates via coef.mvoprobit function.

Arguments

formula: list which i-th element is an object of class "formula" describing the form of the linear index for the i-th ordered equation. Mean and variance equations should be separated by '|' symbol.
formula2: list which i-th element is an object of class "formula" describing the form of the linear index for the i-th continuous equation.
data: data frame containing the variables in the model.
groups: matrix which (i, j)-th element is j-th ordered category (value starting from 0) of i-th dependent ordered variable. Each row of this matrix describes observable (in data) combination of categories i.e. values of dependent variables. Special category '-1' means that variable in j-th column is unobservable when other dependent variables have particular values i.e. given in the same row. See 'Details' for additional information.
groups2: the same as groups argument but for the continuous dependent variables from formula2. See 'Details' for additional information.
marginal: list such that marginal[[i]] represents parameters of marginal distribution of the random error of the i-th ordered equation and names(marginal)[i] is a name of this distribution. Marginal distributions are the same as in pmnorm.
opt_type: character representing optimization function to be used. If opt_type = "optim" then optim will be used. If opt_type = "gena" then gena will be applied i.e. genetic algorithm. If opt_type = "pso" then pso will be used i.e. particle swarm optimization.
opt_args: a list of input arguments for the optimization function selected via opt_type argument. See 'Details' for information.
start: numeric vector of initial parameters' values. It will be used as a starting point for optimization purposes. It is also possible to provide an object of class 'mvoprobit' then its 'par' element will be used as a starting point.
estimator: character determining estimation method. If estimator = "ml" then maximum-likelihood method will be used. If estimator = "2step" then two-step estimation procedure similar to Heckman's method will be applied.
cov_type: character determining the type of covariance matrix to be returned and used for summary. First, suppose that estimator = "ml" then the following estimators are available. If cov_type = "hessian" then negative inverse of Hessian matrix will be applied. If cov_type = "gop" then inverse of Jacobian outer products will be used. If cov_type = "sandwich" (default) then sandwich covariance matrix estimator will be applied. Second, suppose that estimator = "2step" then by default sandwich estimator will be used for the first step parameters and the following estimators are available for the second step parameters. If cov_type = "parametric" then parametric estimator will be used on the second step which assumes joint normality of random errors. If cov_type = "nonparametric" then nonparametric estimator will be used. Also cov_type may be a character vector such that cov_type[i] determines the covariance matrix estimator of the i-th step parameters.
degrees: vector of non-negative integers such that degrees[i] represents degree of polynomial which elements are selectivity correction terms associated with the i-th ordered equation. See 'Details' for additional information.
n_sim: integer representing the number of GHK draws when there are more than 3 ordered equations. Otherwise alternative (much more efficient) algorithms will be used to calculate multivariate normal probabilities.
n_cores: positive integer representing the number of CPU cores used for parallel computing. If possible it is highly recommend to set it equal to the number of available physical cores especially when the system of ordered equations has 2 or 3 equations.
control: a list of control parameters. See 'Details'.
regularization: a list of control parameters for regularization. Element ridge_ind is a vector of indexes of parameters subject to regularization according to quadratic (ridge) penalty function. These indexes correspond to parameters from par output element. Set show_ind argument of summary.mvoprobit to TRUE to see these indexes. Element ridge_scale is a numeric vector of weights of ridge penalty function. Element ridge_location is a numeric vector of values to be subtracted from parameters before they pass into penalty function. Elements lasso_ind, lasso_scale and lasso_location are the same but for the lasso penalty term.

Details

Multivariate ordered probit model with heteroscedastic random errors has the following form: $$z_{ji}^{*} = w_{ji}\gamma_{j} + \sigma_{ji}u_{ji},$$ $$\sigma_{ji} = \exp(w_{ji}^{*}\gamma_{j}^{*}), \quad u_{i}\sim N\left(\begin{bmatrix}0\\ \vdots\\ 0\end{bmatrix}, \Sigma\right), i.i.d.,$$ $$\Sigma = \begin{bmatrix} 1 & \rho_{12} & \rho_{13} & ... & \rho_{1J}\\ \rho_{12} & 1 & \rho_{23} & ... & \rho_{2J}\\ \rho_{13} & \rho_{23} & 1 & ... & \rho_{3J}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ \rho_{1J} & \rho_{2J} & \rho_{3J} & ... & 1 \end{bmatrix},$$ $$z_{ji}=\begin{cases} 0\text{, if }z_{ji}^{*}\leq c_{j1}\\ 1\text{, if }c_{j1}<z_{ji}^{*}\leq c_{j2}\\ 2\text{, if }c_{j2}<z_{ji}^{*}\leq c_{j3}\\ \vdots\\ m_{j}\text{, if }z_{ji}^{*}>c_{jm_{j}}\\ \end{cases},$$ $$z_{i}=(z_{1i},...,z_{Ji})^{T},\quad u_{i} = (u_{1i},u_{2i},...,u_{Ji})^{T},$$ $$i\in\{1,2,...,n\},\quad j\in\{1,2,...,J\}.$$ Where:

$n$ - the number of observations. If there are no omitted observations then $n$ equals to nrow(data).
$J$ - the number of equations i.e. length(formula).
$z_{ji}^{*}$ - unobservable (latent) value of the j-th dependent variable.
$z_{ji}$ - observable (ordered) value of the j-th dependent variable.
$(m_{j} + 1)$ - the number of categories of $z_{ji}\in\{0,1,...,m_{j}\}$.
$c_{jk}$ - $k$-th cut of the j-th dependent variable.
$w_{ji}$ - regressors of the $j$-th mean equation which should be described in formula[[j]].
$\gamma_{j}$ - regression coefficients of the $j$-th mean equation.
$w_{ji}\gamma_{j}$ - linear index of the $j$-th mean equation.
$u_{i}$ - multivariate normal random vector which elements are standard normal random variables.
$\Sigma$ - correlation matrix of $u_{i}$ i.e. $\Sigma_{t_{1}t_{2}}=\rho_{\min(t_{1},t_{2}),\max(t_{1}t_{2})}= \text{corr}\left(u_{it_{1}}, u_{it_{2}}\right)$.
$\sigma_{ji}$ - heteroscedastic standard deviation.
$\sigma_{ji}u_{ji}$ - heteroscedastic random errors.
$w_{ji}^{*}$ - regressors of the $j$-th variance equation which should be described in formula[[j]] after '|' symbol.
$\gamma_{j}^{*}$ - regression coefficients of the $j$-th variance equation.
$w_{ji}^{*}\gamma_{j}^{*}$ - linear index of the $j$-th variance equation.

Parameters of this model are estimated via maximum-likelihood method using numeric optimization approach provided through opt_type argument. The type of covariance matrix estimator may be provided through cov_type argument.

To account for (non-random) sample selection unobservable values of dependent variables should be coded as -1. For example if $z_{1}$ is a binary variable for employment status (0 - unemployed, 1 - employed) and $z_{2}$ is ordered variable (ranging from 0 to 2) for job satisfaction (0 - unsatisfied, 1- satisfied, 2 - highly satisfied) then $z_{2}$ is observable only when $z_{1}$ equals 1 since job satisfaction observable only for working individuals. Consequently $z_{2}$ should be equal to -1 (minus one) whenever $z_{1}$ equals to 0. If variables are coded in this way then groups matrix will be created automatically. Otherwise user may provide manual structure of selection mechanism by mentioning all possible combinations of $z_{1}$ and $z_{2}$ values as a rows of groups matrix. In this particular example matrix groups will have the following form (no need to provide it manually): $$\text{groups} = \begin{bmatrix} 1 & 2\\ 1 & 1\\ 1 & 0\\ 0 & -1 \end{bmatrix}.$$ Again, please, insure that all $z_{2}$ equal to -1 for all $z_{1}$ which equal to 0 in your data. Then matrix groups will automatically have the aforementioned structure (accounting for non-random sample selection).

If some variables $z_{ji}$ are missing i.e. take NA value then contribution of other dependent variables (for the i-th observation) still may be included into the likelihood function by manually substituting NA with -1 in your data. However insure that this particular (missing) $z_{ji}$ is not a regressor for other dependent variable (that may happen in hierarchical systems).

Constant terms (intercepts) are excluded from the model for identification purposes. If $z_{ji}$ is a binary variable then $-c_{j1}$ may be interpreted as a constant term of the $j$-th equation. If all $z_{ji}$ are binary variables then the model becomes multivariate probit.

It is possible to estimate sample-selection and endogenous switching models with continuous dependent variables by providing the form of corresponding equations via formula2 argument. Selection (switching) mechanism will be determined by the aforementioned multivariate ordered probit model.

First, consider sample selection model with one continuous equation ($y$ - wage) and two ordered equations from the previous example ($z_{1}$ - employment, $z_{2}$ - job satisfaction):

$$y_{i}^{*}=x_{i}\beta+\varepsilon_{i},$$ $$y_{i} = \begin{cases} y_{i}^{*}\text{, if }z_{1i}=1\text{ and }z_{2i}\geq 1\\ \text{unobservable, otherwise} \end{cases},$$

where $y_{i}$ and $x_{i}$ are continuous dependent variable and the vector of exogenous variables correspondingly. These variables should be described in formula2. Random errors $\varepsilon_{i}$ and $u_{i}$ are multivariate normal. In this example it is assumed that information on wages $y_{i}$ available only for employed $z_{1i} = 1$ individuals with high enough job satisfaction $z_{2i} \geq 1$ (suppose that unsatisfied workers where not asked wage question or surely refuse to answer).

To estimate this model it is also necessarily to set unobservable values of $y_{i}$ in data to Inf to distinguish them from NA values representing observations omitted by random. Finally one needs to manually specify the structure of equations via groups and groups2 arguments by providing all possible combinations of ordered and continuous equations values:

$$\text{groups} = \begin{bmatrix} 1 & 2\\ 1 & 1\\ 1 & 0\\ 0 & -1 \end{bmatrix}, \text{groups2} = \begin{bmatrix} 0\\ 0\\ -1\\ -1 \end{bmatrix}.$$

Where 0 category in group2 indicates that continuous dependent variable $y_{i}$ is observable. For example groups2[2] = 0 indicates that $y_{i}$ is observable when groups[2, ] = c(1, 1) i.e. $z_{1i} = 1$ and $z_{2i} = 1$.

Further suppose that we assume that wage equations are different for satisfied ($z_{2i} = 1$) and highly satisfied ($z_{2i} = 2$) workers:

$$y_{0i}^{*} = x_{i}\beta_{0} + \varepsilon_{0i},$$ $$y_{1i}^{*} = x_{i}\beta_{1} + \varepsilon_{1i},$$ $$y_{i} = \begin{cases} y_{0i}^{*}\text{, if }z_{1i}=1\text{ and }z_{2i} = 1\\ y_{1i}^{*}\text{, if }z_{1i}=1\text{ and }z_{2i} = 2\\ \text{unobservable, otherwise} \end{cases}.$$

To estimate this endogenous switching model arguments groups and groups2 should be specified as follows:

$$\text{groups} = \begin{bmatrix} 1 & 2\\ 1 & 1\\ 1 & 0\\ 0 & -1 \end{bmatrix}, \text{groups2} = \begin{bmatrix} 1\\ 0\\ -1\\ -1 \end{bmatrix}.$$

For example groups2[1] = 1 indicates that when groups[1, ] = c(1, 2) i.e. $z_{1i} = 1$ and $z_{2i} = 2$ we observe $y_{i}$ in regime 1 corresponding to the wage of highly satisfied workers. Similarly groups2[2] = 0 indicates that when groups[2, ] = c(1, 1) i.e. $z_{1i} = 1$ and $z_{2i} = 1$ we observe $y_{i}$ in regime 0 corresponding to the wage of satisfied workers.

Therefore by specifying groups and groups2 arguments in the aforementioned way it is possible to estimate various sample selection and endogenous switching models. Furthermore one may specify several continuous equations. Indeed, consider additional continuous equation ($y^{h}$ - working hours):

$$y_{i}^{h*}=x_{i}^{h}\beta^{h}+\varepsilon_{i}^{h},$$ $$y_{i}^{h} = \begin{cases} y_{i}^{h*}\text{, if }z_{1i}=1\\ \text{unobservable, otherwise} \end{cases}.$$

Then to estimate the system accounting for this additional continuous equation simply substitute all $y_{i}^{h*}$ (such that $z_{1i}=0$) in data with Inf and specify:

$$\text{groups} = \begin{bmatrix} 1 & 2\\ 1 & 1\\ 1 & 0\\ 0 & -1 \end{bmatrix}, \text{groups2} = \begin{bmatrix} 1 & 0\\ 0 & 0\\ -1 & 0\\ -1 & -1 \end{bmatrix},$$

where groups2[, 1] describes regimes of the wage equation $y_{i}$ while groups[, 2] contains regimes of the hours equation $y_{i}^{h}$. Note that formula of the first equation (wage) should be specified in formula2[[1]] and formula of the second equation should be provided via formula2[[2]] i.e. as the first and the second elements in a formula2 list correspondingly.

By default all the models are estimated via maximum likelihood method. However if estimator = "2step" then models with one continuous equation (but voluntary number of regimes of this equation) will be estimated via two-step procedure proposed by E. Kossova and B. Potanin (2018). The idea is similar to classical Heckman's method i.e. to substitute conditional expectation of random error into continuous equation with it's consistent estimator. For simplicity suppose that there is only one regime. Then regression equation may be represented in the following form:

$$ y_{i}= x_{i}\beta+ \sum\limits_{j=1}^{J} \rho_{j}\sigma\lambda_{ji}+ \varepsilon_{i}^{*}, $$

where: $$ \varepsilon_{i}^{*} = \varepsilon_{i} - E(\varepsilon_{i}|z_{1i},...z_{Ji})= \varepsilon_{i} - \sum\limits_{j=1}^{J} \rho_{j}\sigma\lambda_{ji}, $$ $$ \lambda_{ji}= \lambda_{ji}^{(1)} + \lambda_{ji}^{(2)}, $$ $$ \lambda_{ji}^{(1)}= \begin{cases} 0\text{, if }z_{ji} = 0\\ -\frac{\partial \ln P_{i}^{*}}{\partial a_{ji}}\text{, otherwise } \end{cases}, \qquad \lambda_{ji}^{(2)}= \begin{cases} 0\text{, if }z_{ji} = m_{j}\\ -\frac{\partial \ln P_{i}^{*}}{\partial b_{ji}}\text{, otherwise } \end{cases}, $$ $$ P_{i}^{*}(a_{1i},...,a_{Ji};b_{1i},...,b_{Ji})= P(a_{1i}\leq u_{1i}\leq b_{1i},....,a_{Ji}\leq u_{Ji}\leq b_{Ji}), $$ $$ a_{ji}= \begin{cases} -\infty\text{, if }z_{ji} = 0\\ \frac{c_{jz_{ji}}-w_{ji}\gamma_{j}}{\sigma_{ji}}\text{, otherwise } \end{cases}, \qquad b_{ji}= \begin{cases} \infty\text{, if }z_{ji} = m_{j}\\ \frac{c_{j(z_{ji}+1)}-w_{ji}\gamma_{j}}{\sigma_{ji}}\text{, otherwise } \end{cases}. $$

On the first step $\hat{\lambda}_{ji}$ are calculated using estimates of multivariate ordered probit model. On the second step $\hat{\lambda}_{ji}$ are used as the regressors instead of $\lambda_{ji}$ in a least squares estimation of $y_{i}$ equation. If cov_type = "parametric" then asymptotic covariance matrix estimator proposed by E. Kossova and B. Potanin (2018) is used. If cov_type = "nonparametric" then robust covariance matrix estimator of Newey (2009) is applied. To relax normality assumption of $\varepsilon_{i}$ it is possible to use multivariate generalization of Newey (2009) method described in E. Kossova and B. Potanin (2020). The idea is to use polynomials of $\lambda_{ji}$ on the second step:

$$ y_{i}= x_{i}\beta+ \sum\limits_{j=1}^{J} \sum\limits_{t=1}^{d_{j}}\tau_{jt}\lambda_{ji}^{t}+ \varepsilon_{i}^{*}, $$

where $\tau_{jt}$ are polynomial coefficients. Polynomial order $d_{j}$ is determined by degrees[j] value. If there are more than one regime then degrees should be a matrix such that degrees[r, j] is $d_{j}$ corresponding to the $r$-th regime. However if there are more than one regime and degrees is a vector it will be transformed into a matrix which rows are the same as degrees.

If estimator = "2step" then it is possible to precalculate first step model with mvoprobit function (setting formula2 = NULL) and pass it through the formula argument. It allows to experiment with various formula2 and degrees specifications without extra computational burden associated with the first step estimation.

Function pmnorm is used internally for calculation of multivariate normal probabilities, densities and their derivatives. Marginal distribution of $u_{i}$ may be determined with marginal argument that is similar to the same argument in pmnorm. Note that joint distribution of $u$ (random errors of ordered equations) and $\varepsilon$ (random errors of continuous equations) will be determined by Gaussian copula.

Optimization always starts with optim. If opt_type = "gena" or opt_type = "pso" then gena or pso is used to proceed optimization starting from initial point provided by optim. Manual arguments to optim should be provided in a form of a list through opt_args$optim. Similarly opt_args$gena and opt_args$pso provide manual arguments to gena and pso. For example to provide Nelder-Mead optimizer to optim and restrict the number of genetic algorithm iterations to $10$ make opt_args = list(optim = list(method = "Nelder-Mead"), gena = list(maxiter = 10)).

For more information on multivariate sample selection and endogenous switching models see E. Kossova and B. Potanin (2018), E. Kossova, L. Kupriianova, and B. Potanin (2020) and E. Kossova and B. Potanin (2022).

Currently control has no input arguments intended for the users. This argument is used for some internal purposes of the package.

References

W. K. Newey (2009). Two-step series estimation of sample selection models. The Econometrics Journal, vol. 12(1), pages 217-229.

E. Kossova, B. Potanin (2018). Heckman method and switching regression model multivariate generalization. Applied Econometrics, vol. 50, pages 114-143.

E. Kossova, L. Kupriianova, B. Potanin (2020). Parametric and semiparametric multivariate sample selection models estimators' accuracy: Comparative analysis on simulated data. Applied Econometrics, vol. 57, pages 119-139.

E. Kossova, B. Potanin (2022). Estimation of Gaussian multinomial endogenous switching model. Applied Econometrics, vol. 67, pages 121-143.

Examples

Run this code

# \donttest{
# -------------------------------
# CPS data example
# -------------------------------

# Set seed for reproducibility
set.seed(123)

# Upload data
data(cps)

# Prepare ordered variable for education
cps$educ <- NA
cps$educ[cps$basic == 1] <- 0
cps$educ[cps$bachelor == 1] <- 1
cps$educ[cps$master == 1] <- 2

# Labor supply (probit) model
f_work <- work ~ age + I(age ^ 2) + bachelor + master + health + 
                 slwage + nchild
model1 <- mvoprobit(f_work, data = cps)
summary(model1)
    
# Education choice (ordered probit) model
f_educ <- educ ~ age + I(age ^ 2) + sbachelor + smaster
model2 <- mvoprobit(f_educ, data = cps)
summary(model2)   

# Labor supply with endogenous education 
# treatment (recursive or hierarchical ordered probit) model
model3 <- mvoprobit(list(f_work, f_educ), data = cps)  
summary(model3)         

# Sample selection (on employment) Heckman's model
f_lwage <- lwage ~ age + I(age ^ 2) + bachelor + master + health
cps$lwage[cps$work == 0] <- Inf
model4 <- mvoprobit(f_work, f_lwage, data = cps)
summary(model4)

# Endogenous education treatment with non-random sample selection
model5 <- mvoprobit(list(f_work, f_educ), f_lwage, data = cps)
summary(model5)
 
# Endogenous switching model with non-random sample selection
groups <- cbind(c(1, 1, 1, 0, 0, 0),
                c(0, 1, 2, 0, 1, 2))
groups2 <- matrix(c(0, 1, 2, -1, -1, -1), ncol = 1)
f_lwage2 <- lwage ~ age + I(age ^ 2) + health
model6 <- mvoprobit(list(f_work, f_educ), f_lwage2,
                    groups = groups, groups2 = groups2,
                    data = cps)
summary(model6)
# }
 
# -------------------------------
# Simulated data example 1
# Ordered probit model
# -------------------------------
# ---
# Step 1
# Simulation of data
# ---

# Load required package
library("mnorm")

# Set seed for reproducibility
set.seed(123)

# The number of observations
n <- 1000

# Regressors (covariates)
w1 <- runif(n = n, min = -1, max = 1)
w2 <- runif(n = n, min = -1, max = 1)

# Random errors
u <- rnorm(n = n, mean = 0, sd = 1)

# Coefficients
gamma <- c(-1, 2)

# Linear index
li <- gamma[1] * w1 + gamma[2] * w2

# Latent variable
z_star <- li + u

# Cuts
cuts <- c(-1, 0.5, 2)

# Observable ordered outcome
z <- rep(0, n)
z[(z_star > cuts[1]) & (z_star <= cuts[2])] <- 1
z[(z_star > cuts[2]) & (z_star <= cuts[3])] <- 2
z[z_star > cuts[3]] <- 3
table(z)

# Data
data <- data.frame(w1 = w1, w2 = w2, z = z)

# ---
# Step 2
# Estimation of parameters
# ---

# Estimation
model <- mvoprobit(z ~ w1 + w2,
                   data = data)
summary(model)
# \donttest{
# Compare estimates and true values of parameters
  # regression coefficients
cbind(true = gamma, estimate = model$coef[[1]])
  # cuts
cbind(true = cuts, estimate = model$cuts[[1]])   

# ---
# Step 3
# Estimation of probabilities and marginal effects
# ---

# Predict probability of dependent variable
# equals to 2 for every observation in a sample.
# P(z = 2)
prob <- predict(model, group = 2, type = "prob")
head(prob)

# Calculate mean marginal effect of w2 on P(z = 1)
mean(predict(model, group = 1, type = "prob", me = "w2"))

# Calculate probabilities and marginal effects
# for manually provided observations.
  # new data
newdata <- data.frame(z = c(1, 1), 
                      w1 = c(0.5, 0.2), 
                      w2 = c(-0.3, 0.8))
  # probability P(z = 2)
predict(model, group = 2, type = "prob", newdata = newdata)
  # linear index
predict(model, type = "li", newdata = newdata)  
  # marginal effect of w1 on P(z = 2)
predict(model, group = 2, type = "prob", newdata = newdata, me = "w1")
  # marginal effect of w1 and w2 on P(z = 3)
predict(model, group = 3, type = "prob", 
        newdata = newdata, me = c("w1", "w2"))
  # marginal effect of w2 on the linear index
predict(model, group = 2, type = "li", newdata = newdata, me = "w2")
  # discrete marginal effect i.e. P(z = 2 | w1 = 0.5) - P(z = 2 | w1 = 0.2)
predict(model, group = 2, type = "prob", newdata = newdata, 
        me = "w2", eps = c(0.2, 0.5))
  # adjusted conditional expectation for endogenous switching and 
  # sample selection models with continuous outcome with random error 'e'
  # E(e | z == 2) / cov(e, u)
  # where joint distribution of 'e' and 'u' determined by
  # Gaussian copula and 'e' is normally distributed
predict(model, group = 2, type = "lambda", newdata = newdata)        

# ---
# Step 4
# Ordered logit model
# ---

# Estimate ordered logit model with a unit variance
# that is just a matter of reparametrization i.e.
# do not affect signs and significance of coefficients
# and dot not affect at all marginal effects
logit <- mvoprobit(z ~ w1 + w2,
                   data = data,
                   marginal = "logistic")
summary(logit)

# Compare ordered probit and ordered logit models
# using Akaike and Bayesian information criteria
  # AIC
c(probit = AIC(model), logit = AIC(logit))
  # BIC
c(probit = BIC(model), logit = BIC(logit))

# Calculation of probabilities and marginal effects is identical
# to the previous example
  # probability P(z = 1)
predict(logit, group = 1, type = "prob", newdata = newdata)
  # marginal effect of w2 on P(z = 1)
predict(logit, group = 1, type = "prob", newdata = newdata, me = "w2")
  # E(e | z == 1) / cov(e, u)
predict(logit, group = 1, type = "lambda", newdata = newdata)    

# ---
# Step 5
# Semiparametric model with Gallant and Nychka distribution
# ---

pgn <- mvoprobit(z ~ w1 + w2,
                 data = data,
                 marginal = list("PGN" = 3))
summary(pgn)

# Calculation of probabilities and marginal effects is identical
# to the previous example
  # probability P(z = 3)
predict(pgn, group = 3, type = "prob", newdata = newdata)
  # marginal effect of w2 on P(z = 3)
predict(pgn, group = 3, type = "prob", newdata = newdata, me = "w2")
  # E(e | z == 3) / cov(e, u)
predict(pgn, group = 3, type = "lambda", newdata = newdata)  

# Test normality assumption via likelihood ratio test
lrtest(model, pgn)
# }
 
# \donttest{
# -------------------------------
# Simulated data example 2
# Heteroscedastic ordered 
# probit model
# -------------------------------

# Load required package
library("mnorm")

# ---
# Step 1
# Simulation of data
# ---

# Set seed for reproducibility
set.seed(123)

# The number of observations
n <- 1000

# Regressors (covariates)
w1 <- runif(n = n, min = -1, max = 1)
w2 <- runif(n = n, min = -1, max = 1)
w3 <- runif(n = n, min = -1, max = 1)

# Random errors
u <- rnorm(n, mean = 0, sd = 1)

# Coefficients of mean equation
gamma <- c(-1, 2)

# Coefficients of variance equation
gamma_het <- c(0.5, -1)

# Linear index of mean equation
li <- gamma[1] * w1 + gamma[2] * w2

# Linear index of variance equation
li_het <- gamma_het[1] * w2 + gamma_het[2] * w3

# Heteroscedastic stdandard deviation
# i.e. value of variance equation
sd_het <- exp(li_het)

# Latent variable
z_star <- li + u * sd_het

# Cuts
cuts <- c(-1, 0.5, 2)

# Observable ordered outcome
z <- rep(0, n)
z[(z_star > cuts[1]) & (z_star <= cuts[2])] <- 1
z[(z_star > cuts[2]) & (z_star <= cuts[3])] <- 2
z[z_star > cuts[3]] <- 3
table(z)

# Data
data <- data.frame(w1 = w1, w2 = w2, w3 = w3, z = z)

# ---
# Step 2
# Estimation of parameters
# ---

# Estimation
model <- mvoprobit(z ~ w1 + w2 | w2 + w3,
                   data = data)
summary(model)

# Compare estimates and true values of parameters
  # regression coefficients of mean equation
cbind(true = gamma, estimate = model$coef[[1]])
  # regression coefficients of variance equation
cbind(true = gamma_het, estimate = model$coef_var[[1]])  
  # cuts
cbind(true = cuts, estimate = model$cuts[[1]])   

# Test for homoscedasticity
model0 <- mvoprobit(z ~ w1 + w2, data = data)
lrtest(model, model0)

# ---
# Step 3
# Estimation of probabilities and marginal effects
# ---

# Predict probability of dependent variable
# equals to 2 for every observation in a sample.
# P(z = 2)
prob <- predict(model, group = 2, type = "prob")
head(prob)

# Calculate mean marginal effect of w2 on P(z = 1)
mean(predict(model, group = 1, type = "prob", me = "w2"))

# Calculate corresponding probability, linear
# index and heteroscedastic standard deviations for 
# manually provided observations.
  # new data
newdata <- data.frame(z = c(1, 1), 
                      w1 = c(0.5, 0.2), 
                      w2 = c(-0.3, 0.8),
                      w3 = c(0.6, 0.1))
  # probability P(z = 2)
predict(model, group = 2, type = "prob", newdata = newdata)
  # linear index
predict(model, type = "li", newdata = newdata)
  # standard deviation
predict(model, type = "sd", newdata = newdata)
  # marginal effect of w3 on P(Z = 3)
predict(model, group = 3, type = "prob", newdata = newdata, me = "w3")
  # marginal effect of w2 on the standard error
predict(model, group = 2, type = "sd", newdata = newdata, me = "w2")
  # discrete marginal effect i.e. P(Z = 2 | w1 = 0.5) - P(Z = 2 | w1 = 0.2)
predict(model, group = 2, type = "prob", newdata = newdata,
        me = "w2", eps = c(0.2, 0.5))
# }

# \donttest{
# -------------------------------
# Simulated data example 3
# Bivariate ordered probit model
# with heteroscedastic second
# equation
# -------------------------------

# Load required package
library("mnorm")

# ---
# Step 1
# Simulation of data
# ---

# Set seed for reproducibility
set.seed(123)

# The number of observations
n <- 1000

# Regressors (covariates)
w1 <- runif(n = n, min = -1, max = 1)
w2 <- runif(n = n, min = -1, max = 1)
w3 <- runif(n = n, min = -1, max = 1)
w4 <- runif(n = n, min = -1, max = 1)

# Covariance matrix of random errors
rho <- 0.5
sigma <- matrix(c(1,   rho,
                  rho, 1), 
                nrow = 2)

# Random errors
u <- mnorm::rmnorm(n = n, mean = c(0, 0), sigma = sigma)

# Coefficients
gamma1 <- c(-1, 2)
gamma2 <- c(1, 1.5)

# Coefficients of variance equation
gamma2_het <- c(0.5, -1)

# Linear indexes
li1 <- gamma1[1] * w1 + gamma1[2] * w2
li2 <- gamma2[1] * w2 + gamma2[2] * w3

# Linear index of variance equation
li2_het <- gamma2_het[1] * w2 + gamma2_het[2] * w4

# Heteroscedastic stdandard deviation
# i.e. value of variance equation
sd2_het <- exp(li2_het)

# Latent variables
z1_star <- li1 + u[, 1]
z2_star <- li2 + u[, 2] * sd2_het

# Cuts
cuts1 <- c(-1, 0.5, 2)
cuts2 <- c(-2, 0)

# Observable ordered outcome
  # first outcome
z1 <- rep(0, n)
z1[(z1_star > cuts1[1]) & (z1_star <= cuts1[2])] <- 1
z1[(z1_star > cuts1[2]) & (z1_star <= cuts1[3])] <- 2
z1[z1_star > cuts1[3]] <- 3
  # second outcome
z2 <- rep(0, n)
z2[(z2_star > cuts2[1]) & (z2_star <= cuts2[2])] <- 1
z2[z2_star > cuts2[2]] <- 2
  # distribution
table(z1, z2)

# Data
data <- data.frame(w1 = w1, w2 = w2, 
                   w3 = w3, w4 = w4,
                   z1 = z1, z2 = z2)

# ---
# Step 2
# Estimation of parameters
# ---

# Estimation
model <- mvoprobit(list(z1 ~ w1 + w2,
                        z2 ~ w2 + w3 | w2 + w4),
                   data = data)
summary(model)

# Compare estimates and true values of parameters
  # regression coefficients of the first equation
cbind(true = gamma1, estimate = model$coef[[1]])
  # regression coefficients of the second equation
cbind(true = gamma2, estimate = model$coef[[2]])
  # cuts of the first equation
cbind(true = cuts1, estimate = model$cuts[[1]])   
  # cuts of the second equation
cbind(true = cuts2, estimate = model$cuts[[2]]) 
  # correlation coefficients
cbind(true = rho, estimate = model$sigma[1, 2])  
  # regression coefficients of variance equation
cbind(true = gamma2_het, estimate = model$coef_var[[2]])

# ---
# Step 3
# Estimation of probabilities and linear indexes
# ---

# Predict probability P(z1 = 2, z2 = 0)
prob <- predict(model, group = c(2, 0), type = "prob")
head(prob)

# Calculate mean marginal effect of w2 on:
  # P(z1 = 1)
mean(predict(model, group = c(1, -1), type = "prob", me = "w2"))
  # P(z1 = 1, z2 = 0)
mean(predict(model, group = c(1, 0), type = "prob", me = "w2"))

# Calculate corresponding probability and linear 
# index for manually provided observations.
  # new data
newdata <- data.frame(z1 = c(1, 1), 
                      z2 = c(1, 1),
                      w1 = c(0.5, 0.2), 
                      w2 = c(-0.3, 0.8),
                      w3 = c(0.6, 0.1),
                      w4 = c(0.3, -0.5))
  # probability P(z1 = 2, z2 = 0)
predict(model, group = c(2, 0), type = "prob", newdata = newdata)
  # linear index
predict(model, type = "li", newdata = newdata)  
  # marginal probability P(z2 = 1)
predict(model, group = c(-1, 1), type = "prob", newdata = newdata) 
  # marginal probability P(z1 = 3)
predict(model, group = c(3, -1), type = "prob", newdata = newdata)
  # conditional probability P(z1 = 2 | z2 = 0)
predict(model, group = c(2, 0), given_ind = 2,
        type = "prob", newdata = newdata) 
  # conditional probability P(z2 = 1 | z1 = 3)
predict(model, group = c(3, 1), given_ind = 1,
        type = "prob", newdata = newdata) 
  # marginal effect of w4 on P(Z2 = 2)
predict(model, group = c(-1, 2),
        type = "prob", newdata = newdata, me = "w4")   
  # marginal effect of w4 on P(z1 = 3, Z2 = 2)
predict(model, group = c(3, 2),
        type = "prob", newdata = newdata, me = "w4") 
  # marginal effect of w4 on P(z1 = 3 | z2 = 2)
predict(model, group = c(3, 2), given_ind = 2,
        type = "prob", newdata = newdata, me = "w4")         

# ---
# Step 4
# Replication under non-random sample selection
# ---

# Suppose that z2 is unobservable when z1 = 1 or z1 = 3
z2[(z1 == 1) | (z1 == 3)] <- -1
data$z2 <- z2

# Replicate estimation procedure
model <- mvoprobit(list(z1 ~ w1 + w2,
                        z2 ~ w2 + w3 | w2 + w4),
                   cov_type = "GOP",
                   data = data)
summary(model)

# Compare estimates and true values of parameters
  # regression coefficients of the first equation
cbind(true = gamma1, estimate = model$coef[[1]])
  # regression coefficients of the second equation
cbind(true = gamma2, estimate = model$coef[[2]])
  # cuts of the first equation
cbind(true = cuts1, estimate = model$cuts[[1]])   
  # cuts of the second equation
cbind(true = cuts2, estimate = model$cuts[[2]]) 
  # correlation coefficients
cbind(true = rho, estimate = model$sigma[1, 2])  
  # regression coefficients of variance equation
cbind(true = gamma2_het, estimate = model$coef_var[[2]])

# ---
# Step 5
# Semiparametric model with marginal logistic and PGN distributions
# ---

# Estimate the model
model <- mvoprobit(list(z1 ~ w1 + w2,
                        z2 ~ w2 + w3 | w2 + w4),
                   data = data,
                   marginal = list(PGN = 3, logistic = NULL))
summary(model)
# }

# \donttest{
# -------------------------------
# Simulated data example 4
# Heckman model with
# ordered selection mechanism
# -------------------------------

# Load required package
library("mnorm")

# ---
# Step 1
# Simulation of data
# ---

# Set seed for reproducibility
set.seed(123)

# The number of observations
n <- 1000

# Regressors (covariates)
w1 <- runif(n = n, min = -1, max = 1)
w2 <- runif(n = n, min = -1, max = 1)
w3 <- runif(n = n, min = -1, max = 1)

# Random errors
rho <- 0.5
var.y <- 0.3
sd.y <- sqrt(var.y)
sigma <- matrix(c(1,          rho * sd.y,
                  rho * sd.y, var.y),
                nrow = 2)
errors <- mnorm::rmnorm(n = n, mean = c(0, 0), sigma = sigma)
u <- errors[, 1]
eps <- errors[, 2]

# Coefficients
gamma <- c(-1, 2)
beta <- c(1, -1, 1)

# Linear index
li <- gamma[1] * w1 + gamma[2] * w2
li.y <- beta[1] + beta[2] * w1 + beta[3] * w3

# Latent variable
z_star <- li + u
y_star <- li.y + eps

# Cuts
cuts <- c(-1, 0.5, 2)

# Observable ordered outcome
z <- rep(0, n)
z[(z_star > cuts[1]) & (z_star <= cuts[2])] <- 1
z[(z_star > cuts[2]) & (z_star <= cuts[3])] <- 2
z[z_star > cuts[3]] <- 3
table(z)

# Observable continuous outcome such
# that outcome 'y' is observable only 
# when 'z > 1' and unobservable otherwise
# i.e. when 'z <= 1' we code 'y' as 'Inf'
y <- y_star
y[z <= 1] <- Inf

# Data
data <- data.frame(w1 = w1, w2 = w2, w3 = w3, 
                   z = z, y = y)

# ---
# Step 2
# Estimation of parameters
# ---

# Estimation
model <- mvoprobit(z ~ w1 + w2,
                   y ~ w1 + w3,
                   data = data)
summary(model)

# Compare estimates and true values of parameters
  # regression coefficients of ordered equation
cbind(true = gamma, estimate = model$coef[[1]])
  # cuts
cbind(true = cuts, estimate = model$cuts[[1]])   
  # regression coefficients of continuous equation
cbind(true = beta, estimate = as.numeric(model$coef2[[1]]))  
  # variance
cbind(true = var.y, estimate = as.numeric(model$var2[[1]]))
  # covariance
cbind(true = rho * sd.y, estimate = as.numeric(model$cov2[[1]]))

# ---
# Step 3
# Estimation of expectations and marginal effects
# ---

# New data
newdata <- data.frame(z = 1, 
                      y = 1,
                      w1 = 0.1, 
                      w2 = 0.2,
                      w3 = 0.3)

# Predict unconditional expectation of the dependent variable
predict(model, group2 = 0, newdata = newdata)

# Predict conditional expectations of the dependent variable
  # E(y | z == 2)
predict(model, group = 2, group2 = 0, newdata = newdata)
  # E(y | z == 0)
predict(model, group = 0, group2 = 0, newdata = newdata)

# ---
# Step 4
# Classical Heckman's two-step estimation procedure
# ---

# Predict adjusted conditional expectations
lambda2 <- predict(model, group = 2, type = "lambda")
lambda3 <- predict(model, group = 3, type = "lambda")

# Construct variable responsible for adjusted
# conditional expectation in linear regression equation
data$lambda <- NA
data$lambda[model$data$z == 2] <- lambda2[model$data$z == 2]
data$lambda[model$data$z == 3] <- lambda3[model$data$z == 3]

# Alternatively simply get this variable from the estimation output
# of a selection part of the model
model_probit <- mvoprobit(z ~ w1 + w2, data = data)
data$lambda <- model_probit$lambda

# Estimate model via classical least squares
model_lm <- lm(y ~ w1 + w3, data = data[!is.infinite(data$y), ])
summary(model_lm)

# Estimate model via two-step procedure
model_ts <- lm(y ~ w1 + w3 + lambda, data = data[!is.infinite(data$y), ])
summary(model_ts)

# Automatic estimation of two-step model with robust standard errors
model_ts <- mvoprobit(z ~ w1 + w2,
                      y ~ w1 + w3,
                      data = data, 
                      estimator = "2step")
summary(model_ts)

# Check estimates accuracy
tbl <- cbind(true = beta, 
             ls = coef(model_lm),
             ml = model$coef2[[1]][1, ],
             twostep = model_ts$coef2[[1]][1, ])  
print(tbl)

# ---
# Step 5
# Semiparametric estimation procedure
# ---

# Estimate the model using Lee's method 
# assuming logistic distribution of
# random errors of the selection equation
model_Lee <- mvoprobit(z ~ w1 + w2, 
                       y ~ w1 + w3,
                       data = data, 
                       marginal = list(logistic = NULL),
                       estimator = "2step")
summary(model_Lee)

# One step estimation is also available as well
# as more complex marginal distributions.
# Consider random errors in selection equation
# following PGN distribution with three parameters.
model_sp <- mvoprobit(z ~ w1 + w2, 
                      y ~ w1 + w3,
                      data = data, 
                      marginal = list(PGN = 3))
summary(model_sp)

# To additionally relax normality assumption of
# random error of continuous equation it is possible
# to use Newey's two-step procedure.
model_Newey <- mvoprobit(z ~ w1 + w2, 
                         y ~ w1 + w3,
                         data = data,
                         marginal = list(PGN = 3),
                         estimator = "2step",
                         degrees = 2, 
                         cov_type = "nonparametric")
summary(model_Newey)
# }

# \donttest{
# -------------------------------
# Simulated data example 5
# Endogenous switching model
# with heteroscedastic
# ordered selection mechanism
# -------------------------------

# Load required package
library("mnorm")

# ---
# Step 1
# Simulation of data
# ---

# Set seed for reproducibility
set.seed(123)

# The number of observations
n <- 1000

# Regressors (covariates)
w1 <- runif(n = n, min = -1, max = 1)
w2 <- runif(n = n, min = -1, max = 1)
w3 <- runif(n = n, min = -1, max = 1)

# Random errors
rho_0 <- -0.8
rho_1 <- -0.7
var2_0 <- 0.9
var2_1 <- 1
sd_y_0 <- sqrt(var2_0)
sd_y_1 <- sqrt(var2_1)
cor_y_01 <- 0.7
cov2_01 <- sd_y_0 * sd_y_1 * cor_y_01
cov2_z_0 <- rho_0 * sd_y_0
cov2_z_1 <- rho_1 * sd_y_1
sigma <- matrix(c(1,        cov2_z_0, cov2_z_1,
                  cov2_z_0, var2_0,   cov2_01,
                  cov2_z_1, cov2_01,  var2_1),
                nrow = 3)
errors <- mnorm::rmnorm(n = n, mean = c(0, 0, 0), sigma = sigma)
u <- errors[, 1]
eps_0 <- errors[, 2]
eps_1 <- errors[, 3]

# Coefficients
gamma <- c(-1, 2)
gamma_het <- c(0.5, -1)
beta_0 <- c(1, -1, 1)
beta_1 <- c(2, -1.5, 0.5)

# Linear index of ordered equation
  # mean
li <- gamma[1] * w1 + gamma[2] * w2
  # variance
li_het <- gamma_het[1] * w2 + gamma_het[2] * w3

# Linear index of continuous equation
  # regime 0
li_y_0 <- beta_0[1] + beta_0[2] * w1 + beta_0[3] * w3
  # regime 1
li_y_1 <- beta_1[1] + beta_1[2] * w1 + beta_1[3] * w3

# Latent variable
z_star <- li + u * exp(li_het)
y_0_star <- li_y_0 + eps_0
y_1_star <- li_y_1 + eps_1

# Cuts
cuts <- c(-1, 0.5, 2)

# Observable ordered outcome
z <- rep(0, n)
z[(z_star > cuts[1]) & (z_star <= cuts[2])] <- 1
z[(z_star > cuts[2]) & (z_star <= cuts[3])] <- 2
z[z_star > cuts[3]] <- 3
table(z)

# Observable continuous outcome such
# that outcome 'y' is 
# in regime 1 when 'z == 1', 
# in regime 0 when 'z <= 1',
# unobservable when 'z == 0'
y <- rep(NA, n)
y[z == 0] <- Inf
y[z == 1] <- y_0_star[z == 1]
y[z > 1] <- y_1_star[z > 1]

# Data
data <- data.frame(w1 = w1, w2 = w2, w3 = w3, 
                   z = z, y = y)

# ---
# Step 2
# Estimation of parameters
# ---

# Assign groups
groups <- matrix(0:3, ncol = 1)
groups2 <- matrix(c(-1, 0, 1, 1), ncol = 1)

# Estimation
model <- mvoprobit(list(z ~ w1 + w2 | w2 + w3),
                   list(y ~ w1 + w3),
                   groups = groups, groups2 = groups2,
                   data = data)
summary(model)

# Compare estimates and true values of parameters
  # regression coefficients of ordered equation
cbind(true = gamma, estimate = model$coef[[1]])
cbind(true = gamma_het, estimate = model$coef_var[[1]])
  # cuts
cbind(true = cuts, estimate = model$cuts[[1]])   
  # regression coefficients of continuous equation
cbind(true = beta_0, estimate = model$coef2[[1]][1, ]) 
cbind(true = beta_1, estimate = model$coef2[[1]][2, ])
  # variances
cbind(true = c(var2_0, var2_1), estimate = model$var2[[1]]) 
  # covariances
cbind(true = c(cov2_z_0, cov2_z_1), estimate = model$cov2[[1]])

# ---
# Step 3
# Estimation of expectations and marginal effects
# ---

# New data
newdata <- data.frame(z = 1, 
                      y = 1,
                      w1 = 0.1, 
                      w2 = 0.2,
                      w3 = 0.3)

# Predict unconditional expectation of the dependent variable
  # regime 0 
predict(model, group2 = 0, newdata = newdata)
  # regime 1
predict(model, group2 = 1, newdata = newdata)

# Predict conditional expectations of the dependent variable
# given condition 'z == 0' for regime 1 i.e. E(y1 | z = 0)
predict(model, group = 0, group2 = 1, newdata = newdata)
# }

# \donttest{
# -------------------------------
# Simulated data example 6
# Endogenous switching model with
# multivariate heteroscedastic
# ordered selection mechanism
# -------------------------------

# Load required package
library("mnorm")

# ---
# Step 1
# Simulation of data
# ---

# Set seed for reproducibility
set.seed(123)

# The number of observations
n <- 1000

# Regressors (covariates)
w1 <- runif(n = n, min = -1, max = 1)
w2 <- runif(n = n, min = -1, max = 1)
w3 <- runif(n = n, min = -1, max = 1)
w4 <- runif(n = n, min = -1, max = 1)

# Random errors
rho_z1_z2 <- 0.5
rho_y0_z1 <- 0.6
rho_y0_z2 <- 0.7
rho_y1_z1 <- 0.65
rho_y1_z2 <- 0.75
var20 <- 0.9
var21 <- 1
sd_y0 <- sqrt(var20)
sd_y1 <- sqrt(var21)
cor_y01 <- 0.7
cov201 <- sd_y0 * sd_y1 * cor_y01
cov20_z1 <- rho_y0_z1 * sd_y0
cov21_z1 <- rho_y1_z1 * sd_y1
cov20_z2 <- rho_y0_z2 * sd_y0
cov21_z2 <- rho_y1_z2 * sd_y1
sigma <- matrix(c(1,         rho_z1_z2,  cov20_z1,  cov21_z1, 
                  rho_z1_z2, 1,          cov20_z2,  cov21_z2,
                  cov20_z1,  cov20_z2,   var20,     cov201,
                  cov21_z1,  cov21_z2,   cov201,    var21),
                nrow = 4)
errors <- mnorm::rmnorm(n = n, mean = c(0, 0, 0, 0), sigma = sigma)
u1 <- errors[, 1]
u2 <- errors[, 2]
eps0 <- errors[, 3]
eps1 <- errors[, 4]

# Coefficients
gamma1 <- c(-1, 2)
gamma1_het <- c(0.5, -1)
gamma2 <- c(1, 1)
beta0 <- c(1, -1, 1, -1.2)
beta1 <- c(2, -1.5, 0.5, 1.2)

# Linear index of ordered equation
  # mean
li1 <- gamma1[1] * w1 + gamma1[2] * w2
li2 <- gamma2[1] * w1 + gamma2[2] * w3
  # variance
li1_het <- gamma1_het[1] * w2 + gamma1_het[2] * w3

# Linear index of continuous equation
  # regime 0
li_y0 <- beta0[1] + beta0[2] * w1 + beta0[3] * w3 + beta0[4] * w4
  # regime 1
li_y1 <- beta1[1] + beta1[2] * w1 + beta1[3] * w3 + beta1[4] * w4

# Latent variables
z1_star <- li1 + u1 * exp(li1_het)
z2_star <- li2 + u2
y0_star <- li_y0 + eps0
y1_star <- li_y1 + eps1

# Cuts
cuts1 <- c(-1, 1)
cuts2 <- c(0)

# Observable ordered outcome
  # first
z1 <- rep(0, n)
z1[(z1_star > cuts1[1]) & (z1_star <= cuts1[2])] <- 1
z1[z1_star > cuts1[2]] <- 2
  # second
z2 <- rep(0, n)
z2[z2_star > cuts2[1]] <- 1
table(z1, z2)

# Observable continuous outcome such
# that outcome 'y' is 
# in regime 0 when 'z1 == 1', 
# in regime 1 when 'z1 == 0' or 'z1 == 2',
# unobservable when 'z2 == 0'
y <- rep(NA, n)
y[z1 == 1] <- y0_star[z1 == 1]
y[z1 != 1] <- y1_star[z1 != 1]
y[z2 == 0] <- Inf

# Data
data <- data.frame(w1 = w1, w2 = w2, w3 = w3, w4 = w4,
                   z1 = z1, z2 = z2, y = y)

# ---
# Step 2
# Estimation of parameters
# ---

# Assign groups
groups <- matrix(c(0, 0,
                   0, 1,
                   1, 0,
                   1, 1,
                   2, 0,
                   2, 1), 
                 byrow = TRUE, ncol = 2)
groups2 <- matrix(c(-1, 1, -1, 0, -1, 1), ncol = 1)

# Estimation
model <- mvoprobit(list(z1 ~ w1 + w2 | w2 + w3,
                        z2 ~ w1 + w3),
                   list(y ~ w1 + w3 + w4),
                   groups = groups, groups2 = groups2,
                   data = data)
summary(model)

# Compare estimates and true values of parameters
  # regression coefficients of the first ordered equation
cbind(true = gamma1, estimate = model$coef[[1]])
cbind(true = gamma1_het, estimate = model$coef_var[[1]])
  # regression coefficients of the second ordered equation
cbind(true = gamma2, estimate = model$coef[[2]])
  # cuts
cbind(true = cuts1, estimate = model$cuts[[1]])   
cbind(true = cuts2, estimate = model$cuts[[2]])  
  # regression coefficients of continuous equation
cbind(true = beta0, estimate = model$coef2[[1]][1, ]) 
cbind(true = beta1, estimate = model$coef2[[1]][2, ])
  # variances
cbind(true = c(var20, var21), estimate = model$var2[[1]]) 
  # covariances
cbind(true = c(cov20_z1, cov20_z2), estimate = model$cov2[[1]][1, ])
cbind(true = c(cov21_z1, cov21_z2), estimate = model$cov2[[1]][2, ])

# ---
# Step 3
# Estimation of expectations and marginal effects
# ---

# New data
newdata <- data.frame(z1 = 1, 
                      z2 = 1,
                      y = 1,
                      w1 = 0.1, 
                      w2 = 0.2,
                      w3 = 0.3,
                      w4 = 0.4)

# Predict unconditional expectation of the dependent variable
  # regime 0 
predict(model, group2 = 0, newdata = newdata)
  # regime 1
predict(model, group2 = 1, newdata = newdata)

# Predict conditional expectations of the dependent variable
# E(y1 | z1 = 2, z2 = 1)
predict(model, group = c(2, 1), group2 = 1, newdata = newdata)

# Marginal effect of w3 on E(y1 | z1 = 2, z2 = 1)
predict(model, group = c(2, 1), group2 = 1, newdata = newdata, me = "w3")

# ---
# Step 4
# Two-step estimation procedure
# ---

# For comparison reasons let's estimate the model
# via least squres
# Estimate model via classical least squares for a benchmark
model.ls.0 <- lm(y ~ w1 + w3 + w4,
                 data = data[!is.infinite(data$y) & (data$z1 == 1), ])
model.ls.1 <- lm(y ~ w1 + w3 + w4,
                 data = data[!is.infinite(data$y) & (data$z1 != 1), ])


# Apply two-step procedure
model_ts <-  mvoprobit(list(z1 ~ w1 + w2 | w2 + w3,
                            z2 ~ w1 + w3),
                       y ~ w1 + w3 + w4,
                       groups = groups, groups2 = groups2,
                       estimator = "2step",
                       data = data)
summary(model_ts)

# Use two-step procedure with logistic marginal distributions
# that is multivariate generalization of Lee's method
model_Lee <-  mvoprobit(list(z1 ~ w1 + w2 | w2 + w3,
                             z2 ~ w1 + w3),
                             y ~ w1 + w3 + w4,
                        marginal = list(logistic = NULL, logistic = NULL),
                        groups = groups, groups2 = groups2,
                        estimator = "2step",
                        data = data)
                        
# Apply multivariate generalization of Newey's method
model_Newey <-  mvoprobit(list(z1 ~ w1 + w2 | w2 + w3,
                               z2 ~ w1 + w3),
                               y ~ w1 + w3 + w4,
                          marginal = list(logistic = NULL, logistic = NULL),
                          degrees = c(2, 3),
                          groups = groups, groups2 = groups2,
                          estimator = "2step",
                          data = data)

# Compare accuracy of the methods
  # beta0
tbl <- cbind(true = beta0, 
             ls = coef(model.ls.0),
             ml = model$coef2[[1]][1, ],
             twostep = model_ts$coef2[[1]][1, ],
             Lee = model_Lee$coef2[[1]][1, ],
             Newey = model_Newey$coef2[[1]][1, ])  
print(tbl)
  # beta1
tbl <- cbind(true = beta1, 
             ls = coef(model.ls.1),
             ml = model$coef2[[1]][2, ],
             twostep = model_ts$coef2[[1]][2, ],
             Lee = model_Lee$coef2[[1]][2, ],
             Newey = model_Newey$coef2[[1]][2, ])  
print(tbl)
# }


# \donttest{
# -------------------------------
# Simulated data example 7
# Endogenous multivariate
# switching model with
# multivariate heteroscedastic
# ordered selection mechanism
# -------------------------------

# Load required package
library("mnorm")

# ---
# Step 1
# Simulation of data
# ---

# Set seed for reproducibility
set.seed(123)

# The number of observations
n <- 1000

# Regressors (covariates)
w1 <- runif(n = n, min = -1, max = 1)
w2 <- runif(n = n, min = -1, max = 1)
w3 <- runif(n = n, min = -1, max = 1)
w4 <- runif(n = n, min = -1, max = 1)
w5 <- runif(n = n, min = -1, max = 1)

# Random errors
var20 <- 0.9
var21 <- 1
var_g0 <- 1.1
var_g1 <- 1.2
var_g2 <- 1.3
A <- rWishart(1, 7, diag(7))[, , 1]
B <- diag(sqrt(c(1, 1, var20, var21, 
                 var_g0, var_g1, var_g2)))
sigma <-  B  %*% cov2cor(A) %*% B
errors <- mnorm::rmnorm(n = n, mean = rep(0, nrow(sigma)), sigma = sigma)
u1 <- errors[, 1]
u2 <- errors[, 2]
eps0_y <- errors[, 3]
eps1_y <- errors[, 4]
eps0_g <- errors[, 5]
eps1_g <- errors[, 6]
eps2_g <- errors[, 7]

# Coefficients
gamma1 <- c(-1, 2)
gamma1_het <- c(0.5, -1)
gamma2 <- c(1, 1)
beta0_y <- c(1, -1, 1, -1.2)
beta1_y <- c(2, -1.5, 0.5, 1.2)
beta0_g <- c(-1, 1, 1, 1)
beta1_g <- c(1, -1, 1, 1)
beta2_g <- c(1, 1, -1, 1)

# Linear index of ordered equation
  # mean
li1 <- gamma1[1] * w1 + gamma1[2] * w2
li2 <- gamma2[1] * w1 + gamma2[2] * w3
  # variance
li1_het <- gamma1_het[1] * w2 + gamma1_het[2] * w3

# Linear index of the first continuous equation
  # regime 0
li_y0 <- beta0_y[1] + beta0_y[2] * w1 + beta0_y[3] * w3 + beta0_y[4] * w4
  # regime 1
li_y1 <- beta1_y[1] + beta1_y[2] * w1 + beta1_y[3] * w3 + beta1_y[4] * w4

# Linear index of the second continuous equation
  # regime 0
li_g0 <- beta0_g[1] + beta0_g[2] * w2 + beta0_g[3] * w3 + beta0_g[4] * w5
  # regime 1
li_g1 <- beta1_g[1] + beta1_g[2] * w2 + beta1_g[3] * w3 + beta1_g[4] * w5
  # regime 2
li_g2 <- beta2_g[1] + beta2_g[2] * w2 + beta2_g[3] * w3 + beta2_g[4] * w5

# Latent variables
z1_star <- li1 + u1 * exp(li1_het)
z2_star <- li2 + u2
y0_star <- li_y0 + eps0_y
y1_star <- li_y1 + eps1_y
g0_star <- li_g0 + eps0_g
g1_star <- li_g1 + eps1_g
g2_star <- li_g2 + eps2_g

# Cuts
cuts1 <- c(-1, 1)
cuts2 <- c(0)

# Observable ordered outcome
  # first
z1 <- rep(0, n)
z1[(z1_star > cuts1[1]) & (z1_star <= cuts1[2])] <- 1
z1[z1_star > cuts1[2]] <- 2
  # second
z2 <- rep(0, n)
z2[z2_star > cuts2[1]] <- 1
table(z1, z2)

# Observable continuous outcome such
# that outcome 'y' is 
# in regime 0 when 'z1 == 1', 
# in regime 1 when 'z1 == 0' or 'z1 == 2',
# unobservable when 'z2 == 0'
y <- rep(NA, n)
y[z1 == 1] <- y0_star[z1 == 1]
y[z1 != 1] <- y1_star[z1 != 1]
y[z2 == 0] <- Inf

#' # Observable continuous outcome such
# that outcome 'g' is 
# in regime 0 when 'z1 == z2', 
# in regime 1 when 'z1 > z2',
# in regime 2 when 'z1 < z2',
g <- rep(NA, n)
g[z1 == z2] <- g0_star[z1 == z2]
g[z1 > z2] <- g1_star[z1 > z2]
g[z1 < z2] <- g2_star[z1 < z2]

# Data
data <- data.frame(w1 = w1, w2 = w2, w3 = w3, w4 = w4, w5 = w5,
                   z1 = z1, z2 = z2, y = y, g = g)

# ---
# Step 2
# Estimation of parameters
# ---

# Assign groups
groups <- matrix(c(0, 0,
                   0, 1,
                   1, 0,
                   1, 1,
                   2, 0,
                   2, 1), 
                 byrow = TRUE, ncol = 2)
groups2 <- matrix(NA, nrow = nrow(groups), ncol = 2)
groups2[groups[, 1] == 1, 1] <- 0
groups2[(groups[, 1] == 0) | (groups[, 1] == 2), 1] <- 1
groups2[groups[, 2] == 0, 1] <- -1
groups2[groups[, 1] == groups[, 2], 2] <- 0
groups2[groups[, 1] > groups[, 2], 2] <- 1
groups2[groups[, 1] < groups[, 2], 2] <- 2
cbind(groups, groups2)

# Estimation
model <- mvoprobit(list(z1 ~ w1 + w2 | w2 + w3,
                        z2 ~ w1 + w3),
                   list(y ~ w1 + w3 + w4,
                        g ~ w2 + w3 + w5),
                   groups = groups, groups2 = groups2,
                   data = data)
summary(model)

# Compare estimates and true values of parameters
  # regression coefficients of the first ordered equation
cbind(true = gamma1, estimate = model$coef[[1]])
cbind(true = gamma1_het, estimate = model$coef_var[[1]])
  # regression coefficients of the second ordered equation
cbind(true = gamma2, estimate = model$coef[[2]])
  # cuts
cbind(true = cuts1, estimate = model$cuts[[1]])   
cbind(true = cuts2, estimate = model$cuts[[2]])  
  # regression coefficients of the first continuous equation
cbind(true = beta0_y, estimate = model$coef2[[1]][1, ]) 
cbind(true = beta1_y, estimate = model$coef2[[1]][2, ])
  # regression coefficients of the second continuous equation
cbind(true = beta0_g, estimate = model$coef2[[2]][1, ]) 
cbind(true = beta1_g, estimate = model$coef2[[2]][2, ])
cbind(true = beta2_g, estimate = model$coef2[[2]][3, ])
  # variances
cbind(true = c(var20, var21), estimate = model$var2[[1]]) 
cbind(true = c(var_g0, var_g1, var_g2), estimate = model$var2[[2]]) 
  # correlation between ordered equations
cbind(true = c(sigma[1, 2]), estimate = model$sigma[1, 2])
  # covariances between continious and ordered equations
cbind(true = sigma[1:2, 3], estimate = model$cov2[[1]][1, ])
cbind(true = sigma[1:2, 4], estimate = model$cov2[[1]][2, ])
cbind(true = sigma[1:2, 5], estimate = model$cov2[[2]][1, ])
cbind(true = sigma[1:2, 6], estimate = model$cov2[[2]][2, ])
cbind(true = sigma[1:2, 7], estimate = model$cov2[[2]][3, ])
  # covariances between continuous equations
cbind(true = c(sigma[4, 7], sigma[3, 5], sigma[4, 6]), 
      estimate = model$sigma2[[1]]) 
      
# ---
# Step 3
# Estimation of expectations and marginal effects
# ---

# New data
newdata <- data.frame(z1 = 1, 
                      z2 = 1,
                      y = 1,
                      g = 1,
                      w1 = 0.1, 
                      w2 = 0.2,
                      w3 = 0.3,
                      w4 = 0.4,
                      w5 = 0.5)

# Predict unconditional expectation of the dependent variable
  # regime 0 for 'y' and regime 1 for 'g' i.e. E(y0), E(g1)
predict(model, group2 = c(0, 1), newdata = newdata)

# Predict conditional expectations of the dependent variable
# E(y0 | z1 = 2, z2 = 1), E(g1 | z1 = 2, z2 = 1)
predict(model, group = c(2, 1), group2 = c(0, 1), newdata = newdata)

# Marginal effect of w3 on E(y1 | z1 = 2, z2 = 1) and E(g1 | z1 = 2, z2 = 1)
predict(model, group = c(2, 1), group2 = c(0, 1), 
        newdata = newdata, me = "w3")
# }

Run the code above in your browser using DataLab