The formula specification is a response variable followed by a four part
formula. The first part consists of ordinary covariates, the second part
consists of factors to be projected out. The third part is an
IV-specification. The fourth part is a cluster specification for the
standard errors. I.e. something like y ~ x1 + x2 | f1 + f2 |
(Q|W ~ x3+x4) | clu1 + clu2 where y is the response,
x1,x2 are ordinary covariates, f1,f2 are factors to be
projected out, Q and W are covariates which are
instrumented by x3 and x4, and clu1,clu2 are
factors to be used for computing cluster robust standard errors.
Parts that are not used should be specified as 0, except if it's
at the end of the formula, where they can be omitted. The parentheses
are needed in the third part since | has higher precedence than ~.
As of lfe version 2.0, multiple left hand sides like
y|w|x ~ x1 + x2 |f1+f2|... are allowed.Interactions between a covariate x and a factor f can be
projected out with the syntax x:f.
The terms in the second and fourth parts are not treated as
ordinary formulas, in particular it is not possible with things like
y ~ x1 | x*f, rather one would specify y ~ x1 + x | x:f + f.
Note that f:x also works, since R's parser does not keep the
order. This means that in interactions, the factor must be a
factor, whereas a non-interacted factor will be coerced to a
factor. I.e. in y ~ x1 | x:f1 + f2, the f1 must be a
factor, whereas it will work as expected if f2 is an integer vector.
In older versions of lfe the syntax was felm(y ~ x1 + x2 + G(f1)
+ G(f2), iv=list(Q ~ x3+x4, W ~ x3+x4),
clustervar=c('clu1','clu2')). This syntax still works, but yields a
warning. Users are strongly encouraged to change to the new
multipart formula syntax. The old syntax will be removed at a later time.
The standard errors are adjusted for the reduced degrees of freedom
coming from the dummies which are implicitly present. In the case of
two factors, the exact number of implicit dummies is easy to compute. If there
are more factors, the number of dummies is estimated by assuming there's
one reference-level for each factor, this may be a slight over-estimation,
leading to slightly too large standard errors. Setting exactDOF='rM'
computes the exact degrees of freedom with rankMatrix() in package Matrix.
Note that version 1.1-0 of Matrix has a bug in rankMatrix()
for sparse matrices which may cause it to return the wrong value. A fix is underway.
For the iv-part of the formula, it is only necessary to include the instruments on the
right hand side. The other explanatory covariates, from the first and
second part of formula, are added automatically
in the first stage regression. See the examples.
The contrasts argument is similar to the one in lm(), it
is used for factors in the first part of the formula. The factors in the
second part are analyzed as part of a possible subsequent getfe() call.
The old syntax with a single part formula with the G() syntax for the factors to transform
away is still supported, as well as the clustervar and iv
arguments, but users are encouraged to move to the new multi part
formulas as described here. The clustervar
and iv arguments have been moved to the ... argument list.
They will be removed in some future update.