lfe-package: Linear Group Fixed Effects

Description

The package uses the Method of Alternating Projections to estimate linear models with multiple group fixed effects. A generalization of the within esitmator. It is thread-parallelized and intended for large problems.

Arguments

concept

Method of Alternating Projections
Kaczmarz Method
Fixed Effect Estimator
Wihtin Estimator
Multiple Fixed Effects

Details

This package is intended for linear models with multiple group fixed effects, i.e. with 2 or more factors with a large number of levels. It performs no other functions than lm, but it uses a special method for projecting out multiple group fixed effects from the normal equations, hence it is faster. It is a generalization of the within estimator. This may be required if the groups have high cardinality (many levels), resulting in tens or hundreds of thousands of dummy-variables. It is also useful if one only wants to control for the group effects, without actually estimating them. The package may optionally compute standard errors for the group effects by bootstrapping, but this is a very time- and memory-consuming process compared to finding the point estimates.

As of version 1.6, projecting out interactions between continuous covariates and factors is also supported. I.e. individual slopes, not only individual intercepts.

The estimation is done in two steps. First the other coefficients are estimated with the function felm by centering on all the group means, followed by an OLS (similar to lm). Then the group effects are extracted (if needed) with the function getfe. This method is described in Gaure (2013), but also appears in Guimaraes and Portugal(2010), disguised as the Gauss-Seidel algorithm.

There's also a function demeanlist which just does the centering on an arbitrary matrix, and there's a function compfactor which computes the connected components (which are used for interpreting the group effects when there are only two factors, see the Abowd et al references), they are also returned by getfe).

The centering on the means is done with a tolerance which is set by options(lfe.eps=1e-8) (the default). This is a somewhat conservative tolerance, in many cases I'd guess 1e-6 may be sufficient. This may speed up the centering. In the other direction, setting options(lfe.eps=0) will provide maximum accuracy at the cost of computing time and warnings about convergence failure.

The package is threaded, that is, it may use more than one cpu. The number of threads is fetched upon loading the package, from the environment variable LFE_THREADS (or OMP_NUM_THREADS) and stored by options(lfe.threads=n). This option may be changed prior to calling felm, if so desired. Note that, typically, lfe is limited by memory-bandwidth, not cpu-speed, thus fast memory and large cache is more important than clock-frequency. It's therefore also not always true that running on all available cores is much better than running on half of them.

Threading is only done for the centering; the extraction of the group effects is not threaded. The default method for extracting the group coefficients is the iterative Kaczmarz-method, its tolerance is also the lfe.eps option.

For some datasets the Kaczmarz-method is converging very slowly, in this case it may be replaced with the conjugate gradient method of Rcgmin by setting the option options(lfe.usecg=TRUE).

The package has been tested on datasets with approx 20,000,000 observations with 15 covariates and approx 2,300,000 and 270,000 group levels (the felm took about 50 minutes on 8 cpus, the getfe takes 5 minutes. Though, beware that not only the size of the dataset matters, but also its structure.

The package will work with any positive number of grouping factors, but if more than two, their interpretation is in general not well understood, i.e. one should make sure that the coefficients are estimable.

In the exec-directory there is a perl-script lfescript which is used at the author's site for creating R-scripts from a simple specification file. The format is documented in doc/lfeguide.txt.

lfe is similar in function, though not in method, to the Stata modules a2reg and felsdvreg.

References

Abowd, J.M., F. Kramarz and D.N. Margolis (1999) High Wage Workers and High Wage Firms, Econometrica 67 (1999), no. 2, 251--333. http://dx.doi.org/10.1111/1468-0262.00020 Abowd, J.M., R. Creecy and F. Kramarz (2002) Computing Person and Firm Effects Using Linked Longitudinal Employer-Employee Data. Technical Report TP-2002-06, U.S. Census Bureau. http://lehd.did.census.gov/led/library/techpapers/tp-2002-06.pdf

Andrews, M., L. Gill, T. Schank and R. Upward (2008) High wage workers and low wage firms: negative assortative matching or limited mobility bias? J.R. Stat. Soc.(A) 171(3), 673--697. http://dx.doi.org/10.1111/j.1467-985X.2007.00533.x

Cornelissen, T. (2008) The stata command felsdvreg to fit a linear model with two high-dimensional fixed effects. Stata Journal, 8(2):170--189, 2008. http://econpapers.repec.org/RePEc:tsj:stataj:v:8:y:2008:i:2:p:170-189

Gaure, S. (2013) OLS with Multiple High Dimensional Category Variables. Computational Statistics and Data Analysis, 66:8--18, 2013 http://dx.doi.org/10.1016/j.csda.2013.03.024

Guimaraes, P. and Portugal, P. (2010) A simple feasible procedure to fit models with high-dimensional fixed effects. The Stata Journal, 10(4):629--649, 2010. http://www.stata-journal.com/article.html?article=st0212 Ouazad, A. (2008) A2REG: Stata module to estimate models with two fixed effects. Statistical Software Components S456942, Boston College Department of Economics. http://ideas.repec.org/c/boc/bocode/s456942.html

Examples

Run this code

x <- rnorm(1000)
  x2 <- rnorm(length(x))
  id <- factor(sample(10,length(x),replace=TRUE))
  firm <- factor(sample(3,length(x),replace=TRUE,prob=c(2,1.5,1)))
  year <- factor(sample(10,length(x),replace=TRUE,prob=c(2,1.5,rep(1,8))))
  id.eff <- rnorm(nlevels(id))
  firm.eff <- rnorm(nlevels(firm))
  year.eff <- rnorm(nlevels(year))
  y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] +
         year.eff[year] + rnorm(length(x))
  est <- felm(y ~ x+x2+G(id)+G(firm)+G(year))
  summary(est)

  getfe(est,se=TRUE)
# compare with an ordinary lm
  summary(lm(y ~ x+x2+id+firm+year-1))

Run the code above in your browser using DataLab