The nauf_contrasts function returns a list of contrasts applied to
factors in an object created using a function in the nauf package.
See 'Details'.
nauf_contrasts(object, inc_ordered = FALSE)A nauf.terms object, a model frame made with
nauf_model.frame, a nauf.glm model (see
nauf_glm), or a or
model.
A logical indicating whether or not ordered factor
contrasts should also be returned (default FALSE).
A named list of contrasts for all unordered factors in object,
and also optionally contrasts for ordered factors in object. See
'Details'.
In the nauf package, NA values are used to encode when an
unordered factor is truly not applicable. This is different than
"not available" or "missing at random". The concept applies only to
unordered factors, and indicates that the factor is simply not meaningful
for an observation, or that while the observation may technically be
definable by one of the factor levels, the interpretation of its belonging to
that level isn't the same as for other observations. For imbalanced
observational data, coding unordered factors as NA may also be used to
control for a factor that is only contrastive within a subset of the data due
to the sampling scheme. To understand the output of the
nauf_contrasts function, the treatment of unordered factor contrasts
in the nauf package will first be discussed, using the
plosives dataset included in the package as an example.
In the plosives dataset, the factor ling is coded as
either Monolingual, indicating the observation is from a monolingual
speaker of Spanish, or Bilingual, indicating the observation is from a
Spanish-Quechua bilingual speaker. The dialect factor indicates the
city the speaker is from (one of Cuzco, Lima, or
Valladolid). The Cuzco dialect has both monolingual and bilingual
speakers, but the Lima and Valladolid dialects have only monolingual
speakers. In the case of Valladolid, the dialect is not in contact with
Quechua, and so being monolingual in Valladolid does not mean the same
thing as it does in Cuzco, where it indicates
monolingual as opposed to bilingual. Lima has Spanish-Quechua
bilingual speakers, but the research questions the dataset serves to answer
are specific to monolingual speakers of Spanish in Lima. If we leave the
ling factor coded as is in the dataset and use
named_contr_sum to create the contrasts, we obtain
the following:
| dialect | ling | dialectCuzco | dialectLima | lingBilingual |
| Cuzco | Bilingual | 1 | 0 | 1 |
| Cuzco | Monolingual | 1 | 0 | -1 |
| Lima | Monolingual | 0 | 1 | -1 |
With these contrasts, the regression coefficient dialectLima would not
represent the difference between the intercept and the mean of the Lima
dialect; the mean of the Lima dialect would be the
(Intercept) + dialectLima - lingBilingual. The interpretation of the
lingBilingual coefficient is similarly affected, and the intercept
term averages over the predicted value for the non-existent groups of Lima
bilingual speakers and Valladolid bilingual speakers, losing the
interpretation as the corrected mean (insofar as there can be a corrected
mean in this type of imbalanced data). With the nauf package, we can
instead code non-Cuzco speakers' observations as NA for the
ling factor (i.e. execute
plosives$ling[plosives$dialect != "Cuzco"] <- NA). These NA
values are allowed to pass into the regression's model matrix, and are then
set to 0, effectively creating the following contrasts:
| dialect | ling | dialectCuzco | dialectLima | lingBilingual |
| Cuzco | Bilingual | 1 | 0 | 1 |
| Cuzco | Monolingual | 1 | 0 | -1 |
| Lima | NA | 0 | 1 | 0 |
Because sum contrasts are used, a value of 0 for a dummy variable
averages over the effect of the factor, and the coefficient
lingBilingual only affects the predicted value for observations where
dialect = Cuzco. In a regression fit with these contrasts, the
coefficient dialectLima represents what it should, namely the
difference between the intercept and the mean of the Lima dialect, and the
intercept is again the corrected mean. The lingBilingual coefficient
is now the difference between Cuzco bilingual speakers and the corrected mean
of the Cuzco dialect, which is (Intercept) + dialectCuzco.
These nauf contrasts thus allow us to model all of the data in a
single model without sacrificing the interpretability of the results. In
sociolinguistics, this method is called slashing due to the use of a
forward slash in GoldVarb to indicate that a factor is not applicable.
This same methodology can be applied to other parts of the
plosives dataset where a factor's interpretation is the same
for all observations, but is only contrastive within a subset of the data due
to the sampling scheme. The age and ed factors (speaker age
group and education level, respectively) are factors which can apply to
speakers regardless of their dialect, but in the dataset they are only
contrastive within the Cuzco dialect; all the Lima and Valladolid speakers
are 40 years old or younger with a university education (in the case of
Valladolid, the data come from an already-existing corpus; and in the case of
Lima, the data were collected as part of the same dataset as the Cuzco data,
but as a smaller control group). These factors can be treated just as the
ling factor by setting them to NA for observations from Lima
and Valladolid speakers. Similarly, there is no read speech data for the
Valladolid speakers, and so spont could be coded as NA for
observations from Valladolid speakers.
Using NA values can also allow the inclusion of a random effects
structure which only applies to a subset of the data. The
plosives dataset has data from both read (spont = FALSE;
only Cuzco and Lima) and spontaneous (spont = TRUE; all three
dialects) speech. For the read speech, there are exactly repeated measures
on 54 items, as indicated by the item factor. For the
spontaneous speech, there are not exactly repeated measures, and so in this
subset, item is coded as NA. In a regression fit using
nauf_lmer, nauf_glmer, or nauf_glmer.nb with item
as a grouping factor, the random effects model matrix is created for the read
speech just as it normally is, and for spontaneous speech observations all of
the columns are set to 0 so that the item effects only affect
the fitted values for read speech observations. In this way, the noise
introduced by the read speech items can be accounted for while still
including all of the data in one model, and the same random effects for
speaker can apply to all observations (both read and spontaneous),
which will lead to a more accurate estimation of the fixed, speaker, and item
effects since more information is available than if the read and spontaneous
speech were analyzed in separate models.
There are two situations in which unordered factors will need more than one set
of contrasts: (1) when an unordered factor with NA values interacts
with another unordered factor, and some levels are collinear with NA;
and (2) when an unordered factor is included as a slope for a random effects
grouping factor that has NA values, but only a subset of the levels
for the slope factor occur when the grouping factor is not NA. As an
example of an interaction requiring new contrasts, consider the interaction
dialect * spont (that is, suppose we are interested in whether the
effect of spont is different for Cuzco and Lima). We code
spont as NA when dialect = Valladolid, as mentioned
above. This gives the following contrasts for the main effects:
| dialect | spont | dialectCuzco | dialectLima | spontTRUE |
| Cuzco | TRUE | 1 | 0 | 1 |
| Cuzco | FALSE | 1 | 0 | -1 |
| Lima | TRUE | 0 | 1 | 1 |
| Lima | FALSE | 0 | 1 | -1 |
If we simply multiply these dialect and spont main effect
contrasts together to obtain the contrasts for the interaction (which is what
is done in the default model.matrix method), we get
following contrasts:
| dialect | spont | dialectCuzco:spontTRUE | dialectLima:spontTRUE |
| Cuzco | TRUE | 1 | 0 |
| Cuzco | FALSE | -1 | 0 |
| Lima | TRUE | 0 | 1 |
| Lima | FALSE | 0 | -1 |
However, these contrasts introduce an unnecessary parameter to the model
which causes collinearity with the main effects since
spontTRUE = dialectCuzco:spontTRUE + dialectLima:spontTRUE in all
cases. The functions in the nauf package automatically recognize when
this occurs, and create a second set of contrasts for dialect in which
the Valladolid level is treated as if it were NA (through and
additional call to named_contr_sum):
| dialect | dialect.c2.Cuzco |
| Cuzco | 1 |
| Lima | -1 |
This second set of dialect contrasts is only used when it needs to be.
That is, in this case, these contrasts would be used in the creation of the
model matrix columns for the interaction term dialect:spont term,
but not in the creation of the model matrix columns for the main effect terms
dialect and spont, and when the second set of contrasts is
used, .c2. will appear between the name of the factor and the level so
it can be easily identified:
| dialect | spont | dialectCuzco | dialectLima | spontTRUE | dialect.c2.Cuzco:spontTRUE |
| Cuzco | TRUE | 1 | 0 | 1 | 1 |
| Cuzco | FALSE | 1 | 0 | -1 | -1 |
| Lima | TRUE | 0 | 1 | 1 | -1 |
| Lima | FALSE | 0 | 1 | -1 | 1 |
Turning now to an example of when a random slope requires new contrasts,
consider a random item slope for dialect. Because
dialect = Valladolid only when item is NA, using the
main effect contrasts for dialect for the item slope would
result in collinearity with the item intercept in the random effects
model matrix:
| dialect | item | i01:(Intercept) | i01:dialectCuzco | i01:dialectLima |
| Cuzco | i01 | 1 | 1 | 0 |
| Cuzco | i02 | 0 | 0 | 0 |
| Cuzco | NA | 0 | 0 | 0 |
| Lima | i01 | 1 | 0 | 1 |
| Lima | i02 | 0 | 0 | 0 |
| Lima | NA | 0 | 0 | 0 |
This table shows the random effects model matrix for item i01 for all
possible scenarios, with the rows corresponding to (in order): a Cuzco
speaker producing the read speech plosive in item i01, a Cuzco speaker
producing a read speech plosive in another item, a Cuzco speaker
producing a spontaneous speech plosive, a Lima speaker producing the read
speech plosive in item i01, a Lima speaker producing a read speech
plosive in another item, a Lima speaker producing a spontaneous speech
plosive, and a Valladolid speaker producing a spontaneous speech plosive.
With the main effect contrasts for dialect,
i01:(Intercept) = i01:dialectCuzco + i01:dialectLima in all cases,
causing collinearity. Because this collinearity exists for all read speech
item random effects model matrices, the model is unidentifiable. The
functions in the nauf package automatically detect that this is the
case, and remedy the situation by creating a new set of contrasts used for
the item slope for dialect:
| dialect | item | i01:(Intercept) | i01:dialect.c2.Cuzco |
| Cuzco | i01 | 1 | 1 |
| Cuzco | i02 | 0 | 0 |
| Cuzco | NA | 0 | 0 |
| Lima | i01 | 1 | -1 |
| Lima | i02 | 0 | 0 |
| Lima | NA | 0 | 0 |
If we were to, say, fit the model
intdiff ~ dialect * spont + (1 + dialect | item), then nauf would
additionally recognize that the same set of altered contrasts for
dialect are required in the fixed effects interaction term
dialect:spont and the item slope for dialect, and both
would be labeled with .c2.. In other (rare) cases, more than two sets
of contrasts may be required for a factor, in which case they would have
.c3., .c4. and so on.
In this way, users only need to code unordered factors as NA in the
subsets of the data where they are not contrastive, and nauf handles
the rest. Having described in detail what nauf contrasts are, we now
return to the nauf_contrasts function. The function can be used on
objects of any nauf model, a nauf.terms object, or a
model frame made by nauf_model.frame. It returns a named list
with a matrix for each
unordered factor in object which contains all contrasts associated the
factor. For the model intdiff ~ dialect * spont + (1 + dialect | item),
the result would be a list with elements dialect and spont that
contain the following matrices (see the 'Examples' section for code to
generate this list):
| dialect | Cuzco | Lima | .c2.Cuzco |
| Cuzco | 1 | 0 | 1 |
| Lima | 0 | 1 | -1 |
| spont | TRUE |
| TRUE | 1 |
| FALSE | -1 |
The default is for the list of contrasts to only contain information about
unordered factors. If inc_ordered = TRUE, then the contrast matrices
for any ordered factors in object are also included.
nauf_model.frame, nauf_model.matrix,
nauf_glFormula, nauf_glm, and
nauf_glmer.
dat <- plosives
dat$spont[dat$dialect == "Valladolid"] <- NA
mf <- nauf_model.frame(intdiff ~ dialect * spont + (1 + dialect | item), dat)
nauf_contrasts(mf)
mf <- nauf_model.frame(intdiff ~ dialect * spont + (1 + dialect | item),
dat, ncs_scale = 0.5)
nauf_contrasts(mf)
Run the code above in your browser using DataLab