ecoML: Fitting Parametric Models and Quantifying Missing Information for Ecological Inference in 2x2 Tables

Description

ecoML is used to fit parametric models for ecological inference in $2 \times 2$ tables via Expectation Maximization (EM) algorithms. The data is specified in proportions. At it's most basic setting, the algorithm assumes that the individual-level proportions (i.e., $W_1$ and $W_2$) and distributed bivariate normally (after logit transformations). The function calculates point estimates of the parameters for models based on different assumptions. The standard errors of the point estimates are also computed via Supplemented EM algorithms. Moreover, ecoML quantifies the amount of missing information associated with each parameter and allows researcher to examine the impact of missing information on parameter estimation in ecological inference. The models and algorithms are described in Imai, Lu and Strauss (Forthcoming).

Usage

ecoML(formula, data = parent.frame(), N = NULL, supplement = NULL, 
         theta.start = c(0,0,1,1,0), fix.rho = FALSE,
         context = FALSE, sem = TRUE, epsilon = 10^(-10), 
     maxit = 1000, loglik = TRUE, hyptest = FALSE, verbose = FALSE)

Arguments

formula

A symbolic description of the model to be fit, specifying the column and row margins of $2 \times 2$ ecological tables. Y ~ X specifies Y as the column margin (e.g., turnout) and X (e.g., percent Afri

data

An optional data frame in which to interpret the variables in formula. The default is the environment in which ecoML is called.

An optional variable representing the size of the unit; e.g., the total number of voters.

supplement

An optional matrix of supplemental data. The matrix has two columns, which contain additional individual-level data such as survey data for $W_1$ and $W_2$, respectively. If NULL, no additional individual-level data are included

fix.rho

Logical. If TRUE, the correlation (when context=TRUE) or the partial correlation (when context=FALSE) between $W_1$ and $W_2$ is fixed through the estimation. For details, see Imai, Lu and Strauss(2

context

Logical. If TRUE, the contextual effect is also modeled. In this case, the row margin (i.e., X) and the individual-level rates (i.e., $W_1$ and $W_2$) are assumed to be distributed tri-variate normally (after logit transformations

sem

Logical. If TRUE, the standard errors of parameter estimates are estimated via SEM algorithm, as well as the fraction of missing data. The default is TRUE.

theta.start

A numeric vector that specifies the starting values for the mean, variance, and covariance. When context = FALSE, the elements of theta.start correspond to ($E(W_1)$, $E(W_2)$, $var(W_1)$, $var(W_2)$, $cor(W_1,W_2

epsilon

A positive number that specifies the convergence criterion for EM algorithm. The square root of epsilon is the convergence criterion for SEM algorithm. The default is 10^(-10).

maxit

A positive integer specifies the maximum number of iterations before the convergence criterion is met. The default is 1000.

loglik

Logical. If TRUE, the value of the log-likelihood function at each iteration of EM is saved. The default is TRUE.

hyptest

Logical. If TRUE, model is estimated under the null hypothesis that means of $W1$ and $W2$ are the same. The default is FALSE.

verbose

Logical. If TRUE, the progress of the EM and SEM algorithms is printed to the screen. The default is FALSE.

Value

An object of class ecoML containing the following elements:
callThe matched call.
XThe row margin, $X$.
YThe column margin, $Y$.
NThe size of each table, $N$.
contextThe assumption under which model is estimated. If context = FALSE, CAR assumption is adopted and no contextual effect is modeled. If context = TRUE, NCAR assumption is adopted, and contextual effect is modeled.
semWhether SEM algorithm is used to estimate the standard errors and observed information matrix for the parameter estimates.
fix.rhoWhether the correlation or the partial correlation between $W_1$ an $W_2$ is fixed in the estimation.
r12If fix.rho = TRUE, the value that $corr(W_1, W_2)$ is fixed to.
epsilonThe precision criterion for EM convergence. $\sqrt{\epsilon}$ is the precision criterion for SEM convergence.
theta.semThe ML estimates of $E(W_1)$,$E(W_2)$, $var(W_1)$,$var(W_2)$, and $cov(W_1,W_2)$. If context = TRUE, $E(X)$,$cov(W_1,X)$, $cov(W_2,X)$ are also reported.
WIn-sample estimation of $W_1$ and $W_2$.
suff.statThe sufficient statistics for theta.em.
iters.emNumber of EM iterations before convergence is achieved.
iters.semNumber of SEM iterations before convergence is achieved.
loglikThe log-likelihood of the model when convergence is achieved.
loglik.log.emA vector saving the value of the log-likelihood function at each iteration of the EM algorithm.
mu.log.emA matrix saving the unweighted mean estimation of the logit-transformed individual-level proportions (i.e., $W_1$ and $W_2$) at each iteration of the EM process.
Sigma.log.emA matrix saving the log of the variance estimation of the logit-transformed individual-level proportions (i.e., $W_1$ and $W_2$) at each iteration of EM process. Note, non-transformed variances are displayed on the screen (when verbose = TRUE).
rho.fisher.emA matrix saving the fisher transformation of the estimation of the correlations between the logit-transformed individual-level proportions (i.e., $W_1$ and $W_2$) at each iteration of EM process. Note, non-transformed correlations are displayed on the screen (when verbose = TRUE).
Moreover, when sem=TRUE, ecoML also output the following values:
DMThe matrix characterizing the rates of convergence of the EM algorithms. Such information is also used to calculate the observed-data information matrix
IcomThe (expected) complete data information matrix estimated via SEM algorithm. When context=FALSE, fix.rho=TRUE, Icom is 4 by 4. When context=FALSE, fix.rho=FALSE, Icom is 5 by 5. When context=TRUE, Icom is 9 by 9.
IobsThe observed information matrix. The dimension of Iobs is same as Icom.
ImissThe difference between Icom and Iobs. The dimension of Imiss is same as miss.
VobsThe (symmetrized) variance-covariance matrix of the ML parameter estimates. The dimension of Vobs is same as Icom.
IobsThe (expected) complete-data variance-covariance matrix. The dimension of Iobs is same as Icom.
Vobs.originalThe estimated variance-covariance matrix of the ML parameter estimates. The dimension of Vobs is same as Icom.
FmisThe fraction of missing information associated with each parameter estimation.
VFmisThe proportion of increased variance associated with each parameter estimation due to observed data.
IeigenThe largest eigen value of Imiss.
Icom.transThe complete data information matrix for the fisher transformed parameters.
Iobs.transThe observed data information matrix for the fisher transformed parameters.
Fmis.transThe fractions of missing information associated with the fisher transformed parameters.

Details

When SEM is TRUE, ecoML computes the observed-data information matrix for the parameters of interest based on Supplemented-EM algorithm. The inverse of the observed-data information matrix can be used to estimate the variance-covariance matrix for the parameters estimated from EM algorithms. In addition, it also computes the expected complete-data information matrix. Based on these two measures, one can further calculate the fraction of missing information associated with each parameter. See Imai, Lu and Strauss (2006) for more details about fraction of missing information. Moreover, when hytest=TRUE, ecoML allows to estimate the parametric model under the null hypothesis that mu_1=mu_2. One can then construct the likelihood ratio test to assess the hypothesis of equal means. The associated fraction of missing information for the test statistic can be also calculated. For details, see Imai, Lu and Strauss (2006) for details.

References

Imai, Kosuke, Ying Lu and Aaron Strauss. (Forthcoming). eco: R Package for Ecological Inference in 2x2 Tables Journal of Statistical Software, available at http://imai.princeton.edu/research/eco.html Imai, Kosuke, Ying Lu and Aaron Strauss. (Forthcoming). Bayesian and Likelihood Inference for 2 x 2 Ecological Tables: An Incomplete Data Approach Political Analysis, available at http://imai.princeton.edu/research/eiall.html

Examples

Run this code

## load the census data
data(census)

## NOTE: convergence has not been properly assessed for the following
## examples. See Imai, Lu and Strauss (2006) for more complete analyses.
## In the first example below, in the interest of time, only part of the
## data set is analyzed and the convergence requirement is less stringent
## than the default setting.

## In the second example, the program is arbitrarily halted 100 iterations
## into the simulation, before convergence.

## load the Robinson's census data
data(census)

## fit the parametric model with the default model specifications
res <- ecoML(Y ~ X, data = census[1:100,],N=census[1:100,3],epsilon=10^(-6), verbose = TRUE)
## summarize the results
summary(res)

## obtain out-of-sample prediction
out <- predict(res, verbose = TRUE)
## summarize the results
summary(out)

## fit the parametric model with some individual 
## level data using the default prior specification
surv <- 1:600
res1 <- ecoML(Y ~ X, context = TRUE, data = census[-surv,], 
                   supplement = census[surv,c(4:5,1)], maxit=100, verbose = TRUE)
## summarize the results
summary(res1)

Run the code above in your browser using DataLab