make.mark.model(data, ddl, parameters = list(),
title = "", model.name = NULL, initial = NULL,
call = NULL, default.fixed = TRUE, options = NULL,
profile.int = FALSE, chat = NULL, simplify = TRUE,
input.links = NULL, parm.specific = FALSE,
mlogit0 = FALSE)
process.data
make.design.data
mark
)
(internal use)output
and results
. See mark
for a detailed description of the list contents.mark
to create
the model but it can be called directly to create but not
run the model. All of the arguments have default values
except for the first 2 that specify the processed data
list (data
) and the design data list (ddl
).
If only these 2 arguments are specified default models
are used for the parameters. For example, following with
the example from process.data
and
make.design.data
, the default model can be
created with:
mymodel=make.mark.model(proc.example.data,ddl)
The call to make.mark.model
creates a model object
but does not do the analysis. The function returns a
list that includes several fields including a design
matrix and the MARK input file that will be used with
MARK.EXE
to run the analysis from function
run.mark.model
. The following shows the
names of the list elements in mymodel:
names(mymodel) [1] "data" "model" "title"
"model.name" "links" [6] "mixtures" "call" "parameters"
"input" "number.of.groups" [11] "group.labels" "nocc"
"begin.time" "covariates" "fixed" [16] "design.matrix"
"pims" "design.data" "strata.labels" "mlogit.list" [21]
"simplify"
The list is defined to be a mark object which is done by
assigning a class vector to the list. The classes for an
R object can be viewed with the class function as shown
below:
class(mymodel) [1] "mark" "CJS" Each
MARK model has 2 class values. The first identifies it
as a mark object and the second identifies the type of
mark analysis, which is the default "CJS" (recaptures
only) in this case. The use of the class feature has
advantages in using generic functions and identifying
types of objects. An object of class mark
is
defined in more detail in function mark
.
To fit non-trivial models it is necessary to understand
the remaining calling arguments of make.mark.model
and R formula notation. The model formulae are specified
with the calling argument parameters
. It uses a
similar list structure as the parameters
argument
in make.design.data
. It expects to get a
list with elements named to match the parameters in the
particular analysis (e.g., Phi and p in CJS) and each
list element is a list, so it is a list of lists). For
each parameter, the possible list elements are
formula, link, fixed, component, component.name,
remove.intercept
. In addition, for closed capture models
and robust design model, the element share
is
included in the list for p (capture probabilities) and
GammaDoublePrime (respectively) to indicate whether the
model is shared (share=TRUE) or not-shared (the default)
(share=FALSE) with c (recapture probabilities) and
GammaPrime respectively.
formula
specifies the model for the parameter
using R formula notation. An R formula is denoted with a
~
followed by variables in an equation format
possibly using the +
, *
, and :
operators. For example, ~sex+age
is an additive
model with the main effects of sex
and age
.
Whereas, ~sex*age
includes the main effects and
the interaction and it is equivalent to the formula
specified as ~sex+age+sex:age
where sex:age
is the interaction term. The model ~age+sex:age
is slightly different in that the main effect for
sex
is dropped which means that intercept of the
age
effect is common among the sexes but the age
pattern could vary between the sexes. The model
~sex*Age
which is equivalent to ~sex + Age +
sex:Age
has only 4 parameters and specifies a linear
trend with age and both the intercept and slope vary by
sex. One additional operator that can be useful is
I()
which allows computation of variables on the
fly. For example, the addition of the Agesq variable in
the design data (as described above) can be avoided by
using the notation ~Age + I(Age^2)
which specifies
use of a linear and quadratic effect for age. Note that
specifying the model ~age + I(age^2)
would be
incorrect and would create an error because age
is
a factor variable whereas Age
is not.
As an example, consider developing a model in which Phi
varies by age and p follows a linear trend over time.
This model could be specified and run as follows:
p.Time=list(formula=~Time)
Phi.age=list(formula=~age)
Model.p.Time.Phi.age=make.mark.model(proc.example.data,ddl,
parameters=list(Phi=Phi.age,p=p.Time))
Model.p.Time.Phi.age=run.mark.model(Model.p.Time.Phi.age)
The first 2 commands define the p and Phi models that are
used in the parameter
list in the call to
make.mark.model
. This is a good approach for
defining models because it clearly documents the models,
the definitions can then be used in many calls to
make.mark.model
and it will allow a variety of
models to be developed quickly by creating different
combinations of the parameter models. Using the notation
above with the period separating the parameter name and
the description (eg., p.Time) gives the further advantage
that all possible models can be developed quickly with
the functions create.model.list
and
mark.wrapper
.
Model formula can use any field in the design data and
any individual covariates defined in data
. The
restrictions on individual covariates that was present in
versions before 1.3 have now been removed. You can now
use interactions of individual covariates with all design
data covariates and products of individual covariates.
You can specify interactions of individual covariates and
factor variables in the design data with the formula
notation. For example, ~region*x1
describes a
model in which the slope of x1
varies by
region
. Also, ~time*x1
describes a model in
which the slope for x1
varied by time; however,
there would be only one value of the covariate per animal
so this is not a time varying covariate model. Models
with time varying covariates are now more easily
described with the improvements in version 1.3 but there
are restrictions on how the time varying individual
covariates are named. An example would be a trap
dependence model in which capture probability on occasion
i+1 depends on whether they were captured on occasion i.
If there are n occasions in a CJS model, the 0/1 (not
caught/caught) for occasions 1 to n-1 would be n-1
individual covariates to describe recapture probability
for occasions 2 to n. For times 2 to n, a design data
field could be defined such that the variable timex is 1
if time==x and 0 otherwise. The time varying covariates
must be named with a time suffix on the base name of the
covariate. In this example they would be named as
x2,. . .,xn
and the model could be specified as
~time + x
for time variation and a constant trap
effect or as ~time + time:x
for a trap effect that
varied by time. If in the process.data
call, the argument begin.time
was set to the year
1990, then the variables would have to be named
x1991,x1992,... because the first recapture occasion
would be 1991. Note that the times are different for
different parameters. For example, survival is labeled
based on the beginning of the time interval which would
be 1990 so the variables should be named appropriately
for the parameter model in which they will be used.
In previous versions to handle time varying covariates
with a constant effect, it was necessary to use the
component feature of the parameter specification to be
able to append one or more arbitrary columns to the
design matrix. That is no longer required for individual
covariates and the component feature was removed in
v2.0.8.
There are three other elements of the parameter list that
can be useful on occasion. The first is link
which specifies the link function for transforming
between the beta and real parameters. The possible
values are "logit", "log", "identity" and "mlogit(*)"
where * is a numeric identifier. The "sin" link is not
allowed because all models are specified using a design
matrix. The typical default values are assigned to each
parameter (eg "logit" for probabilities, "log" for N, and
"mlogit" for pent in POPAN), so in most cases it will not
be necessary to specify a link function.
The second is fixed
which allows real parameters
to be set at fixed values. The values for fixed
can be either a single value or a list with 5 alternate
forms for ease in specifying the fixed parameters.
Specifying fixed=value
will set all parameters of
that type to the specified value. For example,
Phi=list(fixed=1)
will set all Phi to 1. This can
be useful for situations like specifying F in the
Burnham/Barker models to all have the same value of 1.
Fixed values can also be specified as a list in which
values are specified for certain indices, times, ages,
cohorts, and groups. The first 3 will be the most
useful. The first list format is the most general and
flexible but it requires an understanding of the PIM
structure and index numbers for the real parameters. For
example,
Phi=list(formula=~time,
fixed=list(index=c(1,4,7),value=1))
specifies Phi varying by time, but the real parameters
1,4,7 are set to 1. The value
field is either a
single constant or its length must match the number of
indices. For example,
Phi=list(formula=~time,
fixed=list(index=c(1,4,7),value=c(1,0,1)))
sets real parameters 1 and 7 to 1 and real parameter 4 to
0. Technically, the index/value format for fixed is not
wedded to the parameter type (i.e., values for p can be
assigned within Phi list), but for the sake of clarity
they should be restricted to fixing real parameters
associated with the particular parameter type. The
time
and age
formats for fixed will
probably be the most useful. The format
fixed=list(time=x, value=y) will set all real parameters
(of that type) for time x to value y. For example,
p=list(formula=~time,fixed=list(time=1986,value=1))
sets up time varying capture probability but all values
of p for 1986 are set to 1. This can be quite useful to
set all values to 0 in years with no sampling (e.g.,
fixed=list(time=c(1982,1984,1986),
value=0)). The age
, cohort
and
group
formats work in a similar fashion. It is
important to recognize that the value you specify for
time
, age
, cohort
and group
must match the values in the design data list. This is
another reason to add binned fields for age, time etc
with add.design.data
after creating the
default design data with make.design.data
.
Also note that the values for time
and
cohort
are affected by the begin.time
argument specified in process.data
. Had I
not specified begin.time=1980
, to set p in the
last occasion (1986), the specification would be
p=list(formula=~time,fixed=list(time=7,value=1))
because begin.time defaults to 1. The advantage of the
time-, age-, and cohort- formats over the index-format is
that it will work regardless of the group definition
which can easily be changed by changing the groups
argument in process.data
. The index-format
will be dependent on the group structure because the
indexing of the PIMS will most likely change with changes
in the group structure.
Parameters can also be fixed at default values by
deleting the specific rows of the design data. See
make.design.data
and material below. The
default value for fixing parameters for deleted design
data can be changed with the default=value
in the
parameter list.
The final useful element of the parameter list is the
remove.intercept
argument. It is set to TRUE to
forcefully remove the intercept. In R notation this can
be done by specifiying the formula notation ~-1+... but
in formula with nested interactions of factor variables
and additive factor variables the -1 notation will not
remove the intercept. It will simply adjust the column
definitions but will keep the same number of columns and
the model will be overparameterized. The problem occurs
with nested factor variables like tostratum within
stratum for multistrata designs (see
mstrata
). As shown in that example, you
can build a formula -1+stratum:tostratum to have
transitions that are stratum-specific. If however you
also want to add a sex effect and you specify
-1+sex+stratum:tostratum it will add 2 columns for sex
labelled M and F when in fact you only want to add one
column because the intercept is already contained within
the stratum:tostratum term. The argument
remove.intercept will forcefully remove the intercept but
it needs to be able to find a column with all 1's. For
example,
Psi=list(formula=~sex+stratum:tostratum,remove.intercept=TRUE)
will work but
Psi=list(formula=~-1+sex+stratum:tostratum,remove.intercept=TRUE)
will not work. Also, the -1 notation should be used when
there is not an added factor variable because
Psi=list(formula=~stratum:tostratum,remove.intercept=TRUE)
will not work because while the stratum:tostratum
effectively includes an intercept it is equivalent to
using an identity matrix and is not specified as
treatment contrast with one of the columns as all 1's.
The argument simplify determines whether the pims are
simplified such that only indices for unique and fixed
real parameters are used. For example, with an all
different PIM structure with CJS with K occasions there
are K*(K-1) real parameters for Phi and p. However, if
you use simplify=TRUE with the default model of
Phi(.)p(.), the pims are re-indexed to be 1 for all the
Phi's and 2 for all the p's because there are only 2
unique real parameters for that model. Using simplify
can speed analysis markedly and probably should always be
used. This was left as an argument only to test that the
simplification was working and produced the same
likelihood and real parameter estimates with and without
simplification. It only adjust the rows of the design
matrix and not the columns. There are some restrictions
for simplification. Real parameters that are given a
fixed value are maintained in the design matrix although
it does simplify amongst the fixed parameters. For
example, if there are 50 real parameters all fixed to a
value of 1 and 30 all fixed to a value of 0, they are
reduced to 2 real parameters fixed to 1 and 0. Also, real
parameters such as Psi in Multistrata and pent in POPAN
that use multinomial logits are not simplified because
they must maintain the structure created by the
multinomial logit link. All other parameters in those
models are simplified. The only downside of
simplification is that the labels for real parameters in
the MARK output are unreliable because there is no longer
a single label for the real parameter because it may
represent many different real parameters in the
all-different PIM structure. This is not a problem with
the labels in R because the real parameter estimates are
translated back to the all-different PIM structure with
the proper labels.
The argument default.fixed
is related to deletion
of design data (see make.design.data
). If
design data are deleted and default.fixed=T
the
missing real parameters are fixed at a reasonable default
to represent structural "zeros". For example, p is set
to 0, Phi is set to 1, pent is set to 0, etc. For some
parameters there are no reasonable values (e.g., N in
POPAN), so not all parameters will have defaults. If a
parameter does not have a default or if
default.fixed=F
then the row in the design matrix
for that parameter is all zeroes and its real value will
depend on the link function. For example, with "logit"
link the real parameter value will be 0.5 and for the log
link it will be 1. As long as the inverse link is
defined for 0 it will not matter in those cases in which
the real parameter is not used because it represents data
that do not exist. For example, in a "CJS" model if
initial captures (releases) only occur every other
occasion, but recaptures (resightings) occurred every
occasion, then every other cohort (row) in the PIM would
have no data. Those rows (cohorts) could be deleted from
the design data and it would not matter if the real
parameter was fixed. However, for the case of a
Jolly-Seber type model (eg POPAN or Pradel models) in
which the likelihood includes a probability for the
leading zeroes and first 1 in a capture history (a
likelihood component for the first capture of unmaked
animals), and groups represent cohorts that enter through
time, you must fix the real parameters for the unused
portion of the PIM (ie for occasions prior to time of
birth for the cohort) such that the estimated probability
of observing the structural 0 is 1. This is easily done
by setting the pent (probability of entry) to 0 or by
setting the probability of capture to 0 or both. In that
case if default.fixed=F
, the probabilities for all
those parameters would be incorrectly set to 0.5 for p
and something non-zero but not predetermined for pent
because of the multinomial logit. Now it may be possible
that the model would correctly estimate these as 0 if the
real parameters were kept in the design, but we know what
those values are in that case and they need not be
estimated. If it is acceptable to set
default.fixed=F
, the functions such as
summary.mark
recognize the non-estimated
real parameters and they are not shown in the summaries
because in essence they do not exist. If
default.fixed=T
the parameters are displayed with
their fixed value and for
summary.mark(mymodel,se=TRUE)
, they are listed as
"Fixed".
Missing design data does have implications for specifying
formula but only when interactions are specified. With
missing design data various factors may not be fully
crossed. For example, with 2 factors A and B, each with
2 levels, the data are fully crossed if there are data
with values A1&B1, A1&B2, A2&B1 and A2&B2. If data exist
for each of the 4 combinations then you can described the
interaction model as ~A*B and it will estimate 4 values.
However, if data are missing for one of more of the 4
cells then the "interaction" formula should be specified
as ~-1+A:B or ~-1+B:A or ~-1+A estimate all of the parameters represented by the
combinations with data. An example of this could be a
marking program with multiple sites which resighted at
all occasions but which only marked at sites on
alternating occasions. In that case time is nested
within site and time-site interaction models would be
specified as ~-1+time:site.
The argument title
supplies a character string
that is used to label the output. The argument
model.name
is a descriptor for the model used in
the analysis. The code constructs a model name from the
formula specified in the call (e.g.,
Phi(~1)p(~time)
) but on occasion the name may be
too long or verbose, so it can be over-ridden with the
model.name
argument.
The final argument initial
can be used to provide
initial estimates for the beta parameters. It is either
1) a single starting value for each parameter, 2) an
unnamed vector of values (one for each parameter), 3)
named vector of values, or 4) the name of mark
object that has already been run. For cases 3 and 4, the
code only uses appropriate initial beta estimates in
which the column names of the design matrix (for model)
or vector names match the column names in the design
matrix of the model to be run. Any remaining beta
parameters without an initial value specified are
assigned a 0 initial value. If case 4 is used the models
must have the same number of rows in the design matrix
and thus presumably the same structure. As long as the
vector elements are named (#3), the length of the
initial
vector no longer needs to match the number
of parameters in the new model as long as the elements
are named. The names can be retrieved either from the
column names of the design matrix or from
rownames(x$results$beta)
where x
is the
name of the mark
object.
options
is a character string that is tacked onto
the Proc Estimate
statement for the MARK .inp
file. It can be used to request options such as
NoStandDM (to not standardize the design matrix) or
SIMANNEAL (to request use of the simulated annealing
optimization method) or any existing or new options that
can be set on the estimate proc.process.data
,make.design.data
,
run.mark.model
mark
data(dipper)
#
# Process data
#
dipper.processed=process.data(dipper,groups=("sex"))
#
# Create default design data
#
dipper.ddl=make.design.data(dipper.processed)
#
# Add Flood covariates for Phi and p that have different values
#
dipper.ddl$Phi$Flood=0
dipper.ddl$Phi$Flood[dipper.ddl$Phi$time==2 | dipper.ddl$Phi$time==3]=1
dipper.ddl$p$Flood=0
dipper.ddl$p$Flood[dipper.ddl$p$time==3]=1
#
# Define range of models for Phi
#
Phidot=list(formula=~1)
Phitime=list(formula=~time)
PhiFlood=list(formula=~Flood)
#
# Define range of models for p
#
pdot=list(formula=~1)
ptime=list(formula=~time)
#
# Make assortment of models
#
dipper.phidot.pdot=make.mark.model(dipper.processed,dipper.ddl,
parameters=list(Phi=Phidot,p=pdot))
dipper.phidot.ptime=make.mark.model(dipper.processed,dipper.ddl,
parameters=list(Phi=Phidot,p=ptime))
dipper.phiFlood.pdot=make.mark.model(dipper.processed,dipper.ddl,
parameters=list(Phi=PhiFlood, p=pdot))
Run the code above in your browser using DataLab