felm: Fit a linear model with multiple group fixed effects

Description

'felm' is used to fit linear models with multiple group fixed effects, similarly to lm. It uses the Method of Alternating projections to sweep out multiple group effects from the normal equations before estimating the remaining coefficients with OLS.

Usage

felm(
  formula,
  data,
  exactDOF = FALSE,
  subset,
  na.action,
  contrasts = NULL,
  weights = NULL,
  ...
)

Value

felm returns an object of class

"felm". It is quite similar to an "lm" object, but not entirely compatible.

The generic summary-method will yield a summary which may be print'ed. The object has some resemblance to an 'lm' object, and some postprocessing methods designed for lm may happen to work. It may however be necessary to coerce the object to succeed with this.

The "felm" object is a list containing the following fields:

coefficients: a numerical vector. The estimated coefficients.
N: an integer. The number of observations
p: an integer. The total number of coefficients, including those projected out.
response: a numerical vector. The response vector.
fitted.values: a numerical vector. The fitted values.
residuals: a numerical vector. The residuals of the full system, with dummies. For IV-estimations, this is the residuals when the original endogenous variables are used, not their predictions from the 1st stage.
r.residuals: a numerical vector. Reduced residuals, i.e. the residuals resulting from predicting without the dummies.
iv.residuals: numerical vector. When using instrumental variables, residuals from 2. stage, i.e. when predicting with the predicted endogenous variables from the 1st stage.
weights: numeric. The square root of the argument weights.
cfactor: factor of length N. The factor describing the connected components of the two first terms in the second part of the model formula.
vcv: a matrix. The variance-covariance matrix.
fe: list of factors. A list of the terms in the second part of the model formula.
stage1: The 'felm' objects for the IV 1st stage, if used. The 1st stage has multiple left hand sides if there are more than one instrumented variable.
iv1fstat: list of numerical vectors. For IV 1st stage, F-value for excluded instruments, the number of parameters in restricted model and in the unrestricted model.
X: matrix. The expanded data matrix, i.e. from the first part of the formula. To save memory with large datasets, it is only included if felm(keepX=TRUE) is specified. Must be included if bccorr() or fevcov() is to be used for correcting limited mobility bias.
cX, cY: matrix. The centred expanded data matrix. Only included if felm(keepCX=TRUE).
boot: The result of a replicate applied to the bootexpr (if used).

Arguments

formula

an object of class '"formula"' (or one that can be coerced to that class): a symbolic description of the model to be fitted. Similarly to 'lm'. See Details.

data

a data frame containing the variables of the model.

exactDOF

logical. If more than two factors, the degrees of freedom used to scale the covariance matrix (and the standard errors) is normally estimated. Setting exactDOF=TRUE causes felm to attempt to compute it, but this may fail if there are too many levels in the factors. exactDOF='rM' will use the exact method in Matrix::rankMatrix(), but this is slower. If neither of these methods works, it is possible to specify exactDOF='mc', which utilizes a Monte-Carlo method to estimate the expectation E(x' P x) = tr(P), the trace of a certain projection, a method which may be more accurate than the default guess.

If the degrees of freedom for some reason are known, they can be specified like exactDOF=342772.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The 'factory-fresh' default is na.omit. Another possible value is NULL, no action. na.exclude is currently not supported.

contrasts

an optional list. See the contrasts.arg of model.matrix.default.

weights

an optional vector of weights to be used in the fitting process. Should be 'NULL' or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used.

...

other arguments.

cmethod character. Which clustering method to use. Known arguments are 'cgm' (the default), 'cgm2' (or 'reghdfe', its alias). These alternate methods will generally yield equivalent results, except in the case of multiway clustering with few clusters along at least one dimension.
keepX logical. To include a copy of the expanded data matrix in the return value, as needed by bccorr() and fevcov() for proper limited mobility bias correction.
keepCX logical. Keep a copy of the centred expanded data matrix in the return value. As list elements cX for the explanatory variables, and cY for the outcome.
keepModel logical. Keep a copy of the model frame.
nostats logical. Don't include covariance matrices in the output, just the estimated coefficients and various descriptive information. For IV, nostats can be a logical vector of length 2, with the last value being used for the 1st stages.
psdef logical. In case of multiway clustering, the method of Cameron, Gelbach and Miller may yield a non-definite variance matrix. Ordinarily this is forced to be semidefinite by setting negative eigenvalues to zero. Setting psdef=FALSE will switch off this adjustment. Since the variance estimator is asymptotically correct, this should only have an effect when the clustering factors have very few levels.
kclass character. For use with instrumental variables. Use a k-class estimator rather than 2SLS/IV. Currently, the values 'nagar', 'b2sls', 'mb2sls', 'liml' are accepted, where the names are from Kolesar et al (2014), as well as a numeric value for the 'k' in k-class. With kclass='liml', felm also accepts the argument fuller=<numeric>, for using a Fuller adjustment of the liml-estimator.
Nboot, bootexpr, bootcluster Since felm has quite a bit of overhead in the creation of the model matrix, if one wants confidence intervals for some function of the estimated parameters, it is possible to bootstrap internally in felm. That is, the model matrix is resampled Nboot times and estimated, and the bootexpr is evaluated inside an sapply. The estimated coefficients and the left hand side(s) are available by name. Any right hand side variable x is available by the name var.x. The "felm"-object for each estimation is available as est. If a bootcluster is specified as a factor, entire levels are resampled. bootcluster can also be a function with no arguments, it should return a vector of integers, the rows to use in the sample. It can also be the string 'model', in which case the cluster is taken from the model. bootexpr should be an expression, e.g. like quote(x/x2 * abs(x3)/mean(y)). It could be wise to specify nostats=TRUE when bootstrapping, unless the covariance matrices are needed in the bootstrap. If you need the covariance matrices in the full estimate, but not in the bootstrap, you can specify it in an attribute "boot" as nostats=structure(FALSE, boot=TRUE).
iv, clustervar deprecated. These arguments will be removed at a later time, but are still supported in this field. Users are STRONGLY encouraged to use multipart formulas instead. In particular, not all functionality is supported with the deprecated syntax; iv-estimations actually run a lot faster if multipart formulas are used, due to new algorithms which I didn't bother to shoehorn in place for the deprecated syntax.

Details

This function is intended for use with large datasets with multiple group effects of large cardinality. If dummy-encoding the group effects results in a manageable number of coefficients, you are probably better off by using lm().

The formula specification is a response variable followed by a four part formula. The first part consists of ordinary covariates, the second part consists of factors to be projected out. The third part is an IV-specification. The fourth part is a cluster specification for the standard errors. I.e. something like y ~ x1 + x2 | f1 + f2 | (Q|W ~ x3+x4) | clu1 + clu2 where y is the response, x1,x2 are ordinary covariates, f1,f2 are factors to be projected out, Q and W are covariates which are instrumented by x3 and x4, and clu1,clu2 are factors to be used for computing cluster robust standard errors. Parts that are not used should be specified as 0, except if it's at the end of the formula, where they can be omitted. The parentheses are needed in the third part since | has higher precedence than ~. Multiple left hand sides like y|w|x ~ x1 + x2 |f1+f2|... are allowed.

Interactions between a covariate x and a factor f can be projected out with the syntax x:f. The terms in the second and fourth parts are not treated as ordinary formulas, in particular it is not possible with things like y ~ x1 | x*f, rather one would specify y ~ x1 + x | x:f + f. Note that f:x also works, since R's parser does not keep the order. This means that in interactions, the factor must be a factor, whereas a non-interacted factor will be coerced to a factor. I.e. in y ~ x1 | x:f1 + f2, the f1 must be a factor, whereas it will work as expected if f2 is an integer vector.

In older versions of lfe the syntax was `felm(y ~ x1 + x2 + G(f1)

G(f2), iv=list(Q ~ x3+x4, W ~ x3+x4), clustervar=c('clu1','clu2'))`. This syntax still works, but yields a warning. Users are strongly encouraged to change to the new multipart formula syntax. The old syntax will be removed at a later time.

The standard errors are adjusted for the reduced degrees of freedom coming from the dummies which are implicitly present. (An exception occurs in the case of clustered standard errors and, specifically, where clusters are nested within fixed effects; see here.) In the case of two factors, the exact number of implicit dummies is easy to compute. If there are more factors, the number of dummies is estimated by assuming there's one reference-level for each factor, this may be a slight over-estimation, leading to slightly too large standard errors. Setting exactDOF='rM' computes the exact degrees of freedom with rankMatrix() in package Matrix.

For the iv-part of the formula, it is only necessary to include the instruments on the right hand side. The other explanatory covariates, from the first and second part of formula, are added automatically in the first stage regression. See the examples.

The contrasts argument is similar to the one in lm(), it is used for factors in the first part of the formula. The factors in the second part are analyzed as part of a possible subsequent getfe() call.

The cmethod argument may affect the clustered covariance matrix (and thus regressor standard errors), either directly or via adjustments to a degrees of freedom scaling factor. In particular, Cameron, Gelbach and Miller (CGM2011, sec. 2.3) describe two possible small cluster corrections that are relevant in the case of multiway clustering.

The first approach adjusts each component of the cluster-robust variance estimator (CRVE) by its own \(c_i\) adjustment factor. For example, the first component (with \(G\) clusters) is adjusted by \(c_1=\frac{G}{G-1}\frac{N-1}{N-K}\), the second component (with \(H\) clusters) is adjusted by \(c_2=\frac{H}{H-1}\frac{N-1}{N-K}\), etc.
The second approach applies the same adjustment to all CRVE components: \(c=\frac{J}{J-1}\frac{N-1}{N-K}\), where \(J=\min(G,H)\) in the case of two-way clustering, for example.

Any differences resulting from these two approaches are likely to be minor, and they will obviously yield exactly the same results when there is only one cluster dimension. Still, CGM2011 adopt the former approach in their own paper and simulations. This is also the default method that felm uses (i.e. cmethod = 'cgm'). However, the latter approach has since been adopted by several other packages that allow for robust inference with multiway clustering. This includes the popular Stata package reghdfe, as well as the FixedEffectModels.jl implementation in Julia. To match results from these packages exactly, use cmethod = 'cgm2' (or its alias, cmethod = 'reghdfe'). It is possible that some residual differences may still remain; see discussion here.

The old syntax with a single part formula with the G() syntax for the factors to transform away is still supported, as well as the clustervar and iv arguments, but users are encouraged to move to the new multi part formulas as described here. The clustervar and iv arguments have been moved to the ... argument list. They will be removed in some future update.

References

Cameron, A.C., J.B. Gelbach and D.L. Miller (2011) Robust inference with multiway clustering, Journal of Business & Economic Statistics 29 (2011), no. 2, 238--249. tools:::Rd_expr_doi("10.1198/jbes.2010.07136")

Kolesar, M., R. Chetty, J. Friedman, E. Glaeser, and G.W. Imbens (2014) Identification and Inference with Many Invalid Instruments, Journal of Business & Economic Statistics (to appear). tools:::Rd_expr_doi("10.1080/07350015.2014.978175")

Examples

Run this code


## Default is to use all cores. We'll limit it to 2 for this example.
oldopts <- options("lfe.threads")
options(lfe.threads = 2)

## Simulate data
set.seed(42)
n <- 1e3

d <- data.frame(
  # Covariates
  x1 = rnorm(n),
  x2 = rnorm(n),
  # Individuals and firms
  id = factor(sample(20, n, replace = TRUE)),
  firm = factor(sample(13, n, replace = TRUE)),
  # Noise
  u = rnorm(n)
)

# Effects for individuals and firms
id.eff <- rnorm(nlevels(d$id))
firm.eff <- rnorm(nlevels(d$firm))

# Left hand side
d$y <- d$x1 + 0.5 * d$x2 + id.eff[d$id] + firm.eff[d$firm] + d$u

## Estimate the model and print the results
est <- felm(y ~ x1 + x2 | id + firm, data = d)
summary(est)
# Compare with lm
summary(lm(y ~ x1 + x2 + id + firm - 1, data = d))

## Example with 'reverse causation' (IV regression)

# Q and W are instrumented by x3 and the factor x4.
d$x3 <- rnorm(n)
d$x4 <- sample(12, n, replace = TRUE)
d$Q <- 0.3 * d$x3 + d$x1 + 0.2 * d$x2 + id.eff[d$id] + 0.3 * log(d$x4) - 0.3 * d$y +
  rnorm(n, sd = 0.3)
d$W <- 0.7 * d$x3 - 2 * d$x1 + 0.1 * d$x2 - 0.7 * id.eff[d$id] + 0.8 * cos(d$x4) -
  0.2 * d$y + rnorm(n, sd = 0.6)

# Add them to the outcome variable
d$y <- d$y + d$Q + d$W

## Estimate the IV model and report robust SEs
ivest <- felm(y ~ x1 + x2 | id + firm | (Q | W ~ x3 + factor(x4)), data = d)
summary(ivest, robust = TRUE)
condfstat(ivest)
# Compare with the not instrumented fit:
summary(felm(y ~ x1 + x2 + Q + W | id + firm, data = d))

## Example with multiway clustering

# Create a large cluster group (500 clusters) and a small one (20 clusters)
d$cl1 <- factor(sample(rep(1:500, length.out = n)))
d$cl2 <- factor(sample(rep(1:20, length.out = n)))
# Function for adding clustered noise to our outcome variable
cl_noise <- function(cl) {
  obs_per_cluster <- n / nlevels(cl)
  unlist(replicate(nlevels(cl),
    rnorm(obs_per_cluster, mean = rnorm(1), sd = runif(1)),
    simplify = FALSE
  ))
}

# New outcome variable
d$y_cl <- d$x1 + 0.5 * d$x2 + id.eff[d$id] + firm.eff[d$firm] +
  cl_noise(d$cl1) + cl_noise(d$cl2)

## Estimate and print the model with cluster-robust SEs (default)
est_cl <- felm(y_cl ~ x1 + x2 | id + firm | 0 | cl1 + cl2, data = d)
summary(est_cl)

# Print ordinary standard errors:
summary(est_cl, robust = FALSE)
# Match cluster-robust SEs from Stata's reghdfe package:
summary(felm(y_cl ~ x1 + x2 | id + firm | 0 | cl1 + cl2,
  data = d,
  cmethod = "reghdfe"
))

## Restore default options
options(oldopts)

Run the code above in your browser using DataLab