This function is an R-function style clone of Sakamoto's CATDAP-02 program
for categorical data analysis. CATDAP-02 can be used to search for the best
subset of explanatory variables which have the most effective information on a
specified response variable. Continuous explanatory variables could be
explanatory variables. In that case CATDAP-02 searches for optimal
categorization of continuous values.
The basic statistic adopted is obtained by the application of the
statistic AIC to the models.
\(E\) denotes the response variable and \(F\) denotes candidate
explanatory variable, and their cell frequencies by
\(n_E(i) (i \in E)\) and
\(n_F(j) (j \in F)\). The cross frequency is denoted by
\(n_{E,F}(i,j)\) \((i \in E, j \in F)\). To measure
the strength of dependence of a specific set of response variables \(E\) on
the explanatory variable \(F\), we use the following statistic:
$$AIC(E;F) = -2\sum_{i \in E, j \in F} n_{E,F}(i,j)\ \ln\{n_{E,F}(i,j)/n_F(j)\} + 2C_F(C_E-1),\ \ (1)$$
where \(C_E\) and \(C_F\) denote the total number of categories of the
corresponding sets of variables, respectively.
The selection of the best subset of explanatory variables is realized by the
search for \(F\) which gives the minimum \(AIC(E;F)\).
In case of \(F=\phi\), the formula (1) reduces to
$$AIC(E;\phi) = -2\sum_{i \in E} n_E(i)\ \ln\{n_E(i)/n\} + 2(C_E-1).$$
Here it is assumed that \(C_\phi=1\) and \(n_\phi(1)=n\).
Sakamoto's original CATDAP outputs \(AIC(E;F) - AIC(E;\phi)\) as the AIC
value instead of \(AIC(E;F)\). By this way the positive value of AIC
indicates that the variable \(F\) is judged to be useless as the explanatory
variable of the \(E\).
On the other hand, this policy make impossible to compare the goodness of the
CATDAP model with other models, logit models for example.
Considering the convenience of users, present "R version CATDAP" provides not
only \(AIC = AIC(E;F) - AIC(E;\phi)\), but \(AIC(E;\phi)\), either. The
latter value is given as base_AIC in the output.
Users could recover \(AIC(E;F)\) by adding AIC and base_AIC.
missingmark
enables missing value handling.
When a positive values, say \(1000\), is set here, any value, say \(x\),
greater than or equal to \(1000\) is treated as a missing value. If
\(1000 \le x < 2000\), \(x\) is treated as a missing
value of the 1st type. If \(2000 \le x < 3000\), \(x\)
is treated as a missing value of the 2nd type, and so on. Generally speaking,
any \(x\) that \(1000k \le x < 1000(k+1)\) is
treated as the \(k\)-th type missing value. Users are referred to the
reference for the technical details of the missing value handling procedure.
For continuous variables, we assume that
\(b_1, b_2, \dots, b_{m+1}\) are boundary values
of \(m\) bins. Output value ranges \(r_i\) \((1 \le i \le m)\) are
defined as follows :
$$r_i = \left[ \; b_i,\; b_{i+1}\; \right. ) \;\; \mathrm{for} \;1 \le i < m,$$
$$r_m = \left[ \; b_m,\; b_{m+1}\; \right] .$$
Specifically, for continuous response variable \(V\),
$$r_i = \left[ \; x_{min} + (i-1)*s,\; x_{min} + i*s \; \right. ) \;\; \mathrm{for} \;1 \le i < m,$$
$$r_m = \left[ \; x_{min} + (m-1)*s,\; x_{max} \; \right] ,$$
where \(x_{min}\) and \(x_{max}\) are the minimum and the
maximums of variable V respectively and
\(s = (x_{max} - x_{min}) / m\).