orm: Ordinal Regression Model

Description

Fits ordinal cumulative probability models for continuous or ordinal response variables, efficiently allowing for a large number of intercepts by capitalizing on the information matrix being sparse. Five different distribution functions are implemented, with the default being the logistic (i.e., the proportional odds model). The ordinal cumulative probability models are stated in terms of exceedance probabilities ($Prob[Y \ge y | X]$) so that as with OLS larger predicted values are associated with larger Y. This is important to note for the asymmetric distributions given by the log-log and complementary log-log families, for which negating the linear predictor does not result in $Prob[Y < y | X]$. The family argument is defined in orm.fit. The model assumes that the inverse of the assumed cumulative distribution function, when applied to one minus the true cumulative distribution function and plotted on the $y$-axis (with the original $y$ on the $x$-axis) yields parallel curves (though not necessarily linear). This can be checked by plotting the inverse cumulative probability function of one minus the empirical distribution function, stratified by X, and assessing parallelism. Note that parametric regression models make the much stronger assumption of linearity of such inverse functions.

For the print method, format of output is controlled by the user previously running options(prType="lang") where lang is "plain" (the default), "latex", or "html".

Quantile.orm creates an R function that computes an estimate of a given quantile for a given value of the linear predictor (using the first intercept). It uses approximate linear interpolation on the original response variable scale (the interpolation is exactly linear of there is only one unique value of the linear predictor given as the second argument).

Usage

orm(formula, data, subset, na.action=na.delete, method="orm.fit", model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE,  penalty=0, penalty.matrix, tol=1e-7, eps=0.005,  var.penalty=c('simple','sandwich'), scale=FALSE, ...)
"print"(x, digits=4, coefs=TRUE, intercepts=x$non.slopes < 10, title, ...)
"Quantile"(object, codes=FALSE, ...)

Arguments

formula

a formula object. An offset term can be included. The offset causes fitting of a model such as $logit(Y=1) = X\beta + W$, where $W$ is the offset variable having no estimated coefficient. The response variable can be any data type; orm converts it in alphabetic or numeric order to a factor variable and recodes it 1,2,... internally.

data

data frame to use. Default is the current frame.

subset

logical expression or vector of subscripts defining a subset of observations to analyze

na.action

function to handle NAs in the data. Default is na.delete, which deletes any observation having response or predictor missing, while preserving the attributes of the predictors and maintaining frequencies of deletions due to each variable in the model. This is usually specified using options(na.action="na.delete").

method

name of fitting function. Only allowable choice at present is orm.fit.

model

causes the model frame to be returned in the fit object

causes the expanded design matrix (with missings excluded) to be returned under the name x. For print, an object created by orm.

causes the response variable (with missings excluded) to be returned under the name y.

linear.predictors

causes the predicted X beta (with missings excluded) to be returned under the name linear.predictors. The first intercept is used.

se.fit

causes the standard errors of the fitted values (on the linear predictor scale) to be returned under the name se.fit. The middle intercept is used.

penalty

see lrm

penalty.matrix

see lrm

tol

singularity criterion (see orm.fit)

eps

difference in $-2 log$ likelihood for declaring convergence

var.penalty

see lrm

scale

set to TRUE to subtract column means and divide by column standard deviations of the design matrix before fitting, and to back-solve for the un-normalized covariance matrix and regression coefficients. This can sometimes make the model converge for very large sample sizes where for example spline or polynomial component variables create scaling problems leading to loss of precision when accumulating sums of squares and crossproducts.

...

arguments that are passed to orm.fit, or from print, to prModFit. Ignored for Quantile. One of the most important arguments is family.

digits

number of significant digits to use

coefs

specify coefs=FALSE to suppress printing the table of model coefficients, standard errors, etc. Specify coefs=n to print only the first n regression coefficients in the model.

intercepts

By default, intercepts are only printed if there are fewer than 10 of them. Otherwise this is controlled by specifying intercepts=FALSE or TRUE.

title

a character string title to be passed to prModFit. Default is constructed from the name of the distribution family.

object

an object created by orm

codes

if TRUE, uses the integer codes $1,2,\ldots,k$ for the $k$-level response in computing the predicted quantile

Value

The returned fit object of orm contains the following components in addition to the ones mentioned under the optional arguments.

References

Sall J: A monotone regression smoother based on ordinal cumulative logistic regression, 1991. Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191--201, 1992.

Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427--2436, 1994.

Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942--951, 1992.

Shao J: Linear model selection by cross-validation. JASA 88:486--494, 1993.

Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305--2314, 1993.

Harrell FE: Model uncertainty, penalization, and parsimony. Available from http://biostat.mc.vanderbilt.edu/FHHandouts.

Examples

Run this code

set.seed(1)
n <- 100
y <- round(runif(n), 2)
x1 <- sample(c(-1,0,1), n, TRUE)
x2 <- sample(c(-1,0,1), n, TRUE)
f <- lrm(y ~ x1 + x2, eps=1e-5)
g <- orm(y ~ x1 + x2, eps=1e-5)
max(abs(coef(g) - coef(f)))
w <- vcov(g, intercepts='all') / vcov(f) - 1
max(abs(w))

set.seed(1)
n <- 300
x1 <- c(rep(0,150), rep(1,150))
y <- rnorm(n) + 3*x1
g <- orm(y ~ x1)
g
k <- coef(g)
i <- num.intercepts(g)
h <- orm(y ~ x1, family=probit)
ll <- orm(y ~ x1, family=loglog)
cll <- orm(y ~ x1, family=cloglog)
cau <- orm(y ~ x1, family=cauchit)
x <- 1:i
z <- list(logistic=list(x=x, y=coef(g)[1:i]),
          probit  =list(x=x, y=coef(h)[1:i]),
          loglog  =list(x=x, y=coef(ll)[1:i]),
          cloglog =list(x=x, y=coef(cll)[1:i]))
labcurve(z, pl=TRUE, col=1:4, ylab='Intercept')

tapply(y, x1, mean)
m <- Mean(g)
m(w <- k[1] + k['x1']*c(0,1))
mh <- Mean(h)
wh <- coef(h)[1] + coef(h)['x1']*c(0,1)
mh(wh)

qu <- Quantile(g)
# Compare model estimated and empirical quantiles
cq <- function(y) {
   cat(qu(.1, w), tapply(y, x1, quantile, probs=.1), '\n')
   cat(qu(.5, w), tapply(y, x1, quantile, probs=.5), '\n')
   cat(qu(.9, w), tapply(y, x1, quantile, probs=.9), '\n')
   }
cq(y)

# Try on log-normal model
g <- orm(exp(y) ~ x1)
g
k <- coef(g)
plot(k[1:i])
m <- Mean(g)
m(w <- k[1] + k['x1']*c(0,1))
tapply(exp(y), x1, mean)

qu <- Quantile(g)
cq(exp(y))

# Compare predicted mean with ols for a continuous x
set.seed(3)
n <- 200
x1 <- rnorm(n)
y <- x1 + rnorm(n)
dd <- datadist(x1); options(datadist='dd')
f <- ols(y ~ x1)
g <- orm(y ~ x1, family=probit)
h <- orm(y ~ x1, family=logistic)
w <- orm(y ~ x1, family=cloglog)
mg <- Mean(g); mh <- Mean(h); mw <- Mean(w)
r <- rbind(ols      = Predict(f, conf.int=FALSE),
           probit   = Predict(g, conf.int=FALSE, fun=mg),
           logistic = Predict(h, conf.int=FALSE, fun=mh),
           cloglog  = Predict(w, conf.int=FALSE, fun=mw))
plot(r, groups='.set.')

# Compare predicted 0.8 quantile with quantile regression
qu <- Quantile(g)
qu80 <- function(lp) qu(.8, lp)
f <- Rq(y ~ x1, tau=.8)
r <- rbind(probit   = Predict(g, conf.int=FALSE, fun=qu80),
           quantreg = Predict(f, conf.int=FALSE))
plot(r, groups='.set.')

# Verify transformation invariance of ordinal regression
ga <- orm(exp(y) ~ x1, family=probit)
qua <- Quantile(ga)
qua80 <- function(lp) log(qua(.8, lp))
r <- rbind(logprobit = Predict(ga, conf.int=FALSE, fun=qua80),
           probit    = Predict(g,  conf.int=FALSE, fun=qu80))
plot(r, groups='.set.')

# Try the same with quantile regression.  Need to transform x1
fa <- Rq(exp(y) ~ rcs(x1,5), tau=.8)
r <- rbind(qr    = Predict(f, conf.int=FALSE),
           logqr = Predict(fa, conf.int=FALSE, fun=log))
plot(r, groups='.set.')
options(datadist=NULL)
## Not run: 
# ## Simulate power and type I error for orm logistic and probit regression
# ## for likelihood ratio, Wald, and score chi-square tests, and compare
# ## with t-test
# require(rms)
# set.seed(5)
# nsim <- 2000
# r <- NULL
# for(beta in c(0, .4)) {
#   for(n in c(10, 50, 300)) {
#     cat('beta=', beta, '  n=', n, '\n\n')
#     plogistic <- pprobit <- plogistics <- pprobits <- plogisticw <-
#       pprobitw <- ptt <- numeric(nsim)
#     x <- c(rep(0, n/2), rep(1, n/2))
#     pb <- setPb(nsim, every=25, label=paste('beta=', beta, '  n=', n))
#     for(j in 1:nsim) {
#       pb(j)
#       y <- beta*x + rnorm(n)
#       tt <- t.test(y ~ x)
#       ptt[j] <- tt$p.value
#       f <- orm(y ~ x)
#       plogistic[j]  <- f$stats['P']
#       plogistics[j] <- f$stats['Score P']
#       plogisticw[j] <- 1 - pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1)
#       f <- orm(y ~ x, family=probit)
#       pprobit[j]  <- f$stats['P']
#       pprobits[j] <- f$stats['Score P']
#       pprobitw[j] <- 1 - pchisq(coef(f)['x']^2 / vcov(f)[2,2], 1)
#     }
#     if(beta == 0) plot(ecdf(plogistic))
#     r <- rbind(r, data.frame(beta         = beta, n=n,
#                              ttest        = mean(ptt        < 0.05),
#                              logisticlr   = mean(plogistic  < 0.05),
#                              logisticscore= mean(plogistics < 0.05),
#                              logisticwald = mean(plogisticw < 0.05),
#                              probit       = mean(pprobit    < 0.05),
#                              probitscore  = mean(pprobits   < 0.05),
#                              probitwald   = mean(pprobitw   < 0.05)))
#   }
# }
# print(r)
# #  beta   n  ttest logisticlr logisticscore logisticwald probit probitscore probitwald
# #1  0.0  10 0.0435     0.1060        0.0655        0.043 0.0920      0.0920     0.0820
# #2  0.0  50 0.0515     0.0635        0.0615        0.060 0.0620      0.0620     0.0620
# #3  0.0 300 0.0595     0.0595        0.0590        0.059 0.0605      0.0605     0.0605
# #4  0.4  10 0.0755     0.1595        0.1070        0.074 0.1430      0.1430     0.1285
# #5  0.4  50 0.2950     0.2960        0.2935        0.288 0.3120      0.3120     0.3120
# #6  0.4 300 0.9240     0.9215        0.9205        0.920 0.9230      0.9230     0.9230
# ## End(Not run)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

References

See Also

Examples