# detectseparation v0.1

0

0th

Percentile

## Detect and Check for Separation and Infinite Maximum Likelihood Estimates

Provides pre-fit and post-fit methods for detecting separation and infinite maximum likelihood estimates in generalized linear models with categorical responses. The pre-fit methods apply on binomial-response generalized liner models such as logit, probit and cloglog regression, and can be directly supplied as fitting methods to the glm() function. They solve the linear programming problems for the detection of separation developed in Konis (2007, <https://ora.ox.ac.uk/objects/uuid:8f9ee0d0-d78e-4101-9ab4-f9cbceed2a2a>) using 'ROI' <https://cran.r-project.org/package=ROI> or 'lpSolveAPI' <https://cran.r-project.org/package=lpSolveAPI>. The post-fit methods apply to models with categorical responses, including binomial-response generalized linear models and multinomial-response models, such as baseline category logits and adjacent category logits models; for example, the models implemented in the 'brglm2' <https://cran.r-project.org/package=brglm2> package. The post-fit methods successively refit the model with increasing number of iteratively reweighted least squares iterations, and monitor the ratio of the estimated standard error for each parameter to what it has been in the first iteration. According to the results in Lesaffre & Albert (1989, <https://www.jstor.org/stable/2345845>), divergence of those ratios indicates data separation.

# detectseparation

detectseparation provides pre-fit and post-fit methods for the detection of separation and of infinite maximum likelihood estimates in binomial response generalized linear models.

The key methods are detect_separation and check_infinite_estimates and this vignettes describes their use.

## Installation

You can install the released version of detectseparation from CRAN with:

install.packages("detectseparation")


And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("ikosmidis/detectseparation")


## Detecting and checking for Infinite maximum likelihood estimates

Heinze and Schemper (2002) used a logistic regression model to analyze data from a study on endometrial cancer (see, Agresti 2015, Section 5.7 or ?endometrial for more details on the data set). Below, we refit the model in Heinze and Schemper (2002) in order to demonstrate the functionality that detectseparation provides.

library("detectseparation")
data("endometrial", package = "detectseparation")
endo_glm <- glm(HG ~ NV + PI + EH, family = binomial(), data = endometrial)
theta_mle <- coef(endo_glm)
summary(endo_glm)
#>
#> Call:
#> glm(formula = HG ~ NV + PI + EH, family = binomial(), data = endometrial)
#>
#> Deviance Residuals:
#>      Min        1Q    Median        3Q       Max
#> -1.50137  -0.64108  -0.29432   0.00016   2.72777
#>
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)
#> (Intercept)    4.30452    1.63730   2.629 0.008563 **
#> NV            18.18556 1715.75089   0.011 0.991543
#> PI            -0.04218    0.04433  -0.952 0.341333
#> EH            -2.90261    0.84555  -3.433 0.000597 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#>     Null deviance: 104.903  on 78  degrees of freedom
#> Residual deviance:  55.393  on 75  degrees of freedom
#> AIC: 63.393
#>
#> Number of Fisher Scoring iterations: 17


The maximum likelihood (ML) estimate of the parameter for NV is actually infinite. The reported, apparently finite value is merely due to false convergence of the iterative estimation procedure. The same is true for the estimated standard error, and, hence the value r round(coef(summary(endo_glm))["NV", "z value"], 3) for the z-statistic cannot be trusted for inference on the size of the effect for NV.

### detect_separation

detect_separation is pre-fit method, in the sense that it does not need to estimate the model to detect separation and/or identify infinite estimates. For example

endo_sep <- glm(HG ~ NV + PI + EH, data = endometrial,
family = binomial("logit"),
method = "detect_separation")
endo_sep
#> Implementation: ROI | Solver: lpsolve
#> Separation: TRUE
#> Existence of maximum likelihood estimates
#> (Intercept)          NV          PI          EH
#>           0         Inf           0           0
#> 0: finite value, Inf: infinity, -Inf: -infinity


So, the actual maximum likelihood estimates are

coef(endo_glm) + coef(endo_sep)
#> (Intercept)          NV          PI          EH
#>   4.3045178         Inf  -0.0421834  -2.9026056


and the estimated standard errors are

coef(summary(endo_glm))[, "Std. Error"] + abs(coef(endo_sep))
#> (Intercept)          NV          PI          EH
#>  1.63729861         Inf  0.04433196  0.84555156


### check_infinite_estimates

Lesaffre and Albert (1989, Section 4) describe a procedure that can hint on the occurrence of infinite estimates. In particular, the model is successively refitted, by increasing the maximum number of allowed iteratively re-weighted least squares iterations at east step. The estimated asymptotic standard errors from each step are, then, divided to the corresponding ones from the first fit. If the sequence of ratios diverges, then the maximum likelihood estimate of the corresponding parameter is minus or plus infinity. The following code chunk applies this process to endo_glm.

(inf_check <- check_infinite_estimates(endo_glm))
#>       (Intercept)           NV       PI       EH
#>  [1,]    1.000000 1.000000e+00 1.000000 1.000000
#>  [2,]    1.424352 2.092407e+00 1.466885 1.672979
#>  [3,]    1.590802 8.822303e+00 1.648003 1.863563
#>  [4,]    1.592818 6.494231e+01 1.652508 1.864476
#>  [5,]    1.592855 7.911035e+02 1.652591 1.864492
#>  [6,]    1.592855 1.588973e+04 1.652592 1.864493
#>  [7,]    1.592855 5.298760e+05 1.652592 1.864493
#>  [8,]    1.592855 2.332822e+07 1.652592 1.864493
#>  [9,]    1.592855 2.332822e+07 1.652592 1.864493
#> [10,]    1.592855 2.332822e+07 1.652592 1.864493
#> [11,]    1.592855 2.332822e+07 1.652592 1.864493
#> [12,]    1.592855 2.332822e+07 1.652592 1.864493
#> [13,]    1.592855 2.332822e+07 1.652592 1.864493
#> [14,]    1.592855 2.332822e+07 1.652592 1.864493
#> [15,]    1.592855 2.332822e+07 1.652592 1.864493
#> [16,]    1.592855 2.332822e+07 1.652592 1.864493
#> [17,]    1.592855 2.332822e+07 1.652592 1.864493
#> [18,]    1.592855 2.332822e+07 1.652592 1.864493
#> [19,]    1.592855 2.332822e+07 1.652592 1.864493
#> [20,]    1.592855 2.332822e+07 1.652592 1.864493
#> attr(,"class")
#> [1] "inf_check"
plot(inf_check)


# References

Agresti, A. 2015. Foundations of Linear and Generalized Linear Models. Wiley Series in Probability and Statistics. Wiley.

Heinze, G., and M. Schemper. 2002. “A Solution to the Problem of Separation in Logistic Regression.” Statistics in Medicine 21: 2409–19.

Lesaffre, E., and A. Albert. 1989. “Partial Separation in Logistic Discrimination.” Journal of the Royal Statistical Society. Series B (Methodological) 51 (1): 109–16. http://www.jstor.org/stable/2345845.

## Functions in detectseparation

 Name Description detectseparation detectseparation: Methods for Detecting and Checking for Separation and Infinite Maximum Likelihood Estimates endometrial Histology grade and risk factors for 79 cases of endometrial cancer check_infinite_estimates Generic method for checking for infinite estimates check_infinite_estimates.glm A simple diagnostic of whether the maximum likelihood estimates are infinite detect_separation_control Auxiliary function for the glm interface when method is detect_separation. detect_separation Method for glm that tests for data separation and finds which parameters have infinite maximum likelihood estimates in generalized linear models with binomial responses lizards Habitat preferences of lizards No Results!