harsm: Friendly interface to softmax regression under Harville model.

Description

A user friendly interface to the softmax regression under the Harville model.

Usage

harsm(formula, data, group = NULL, weights = NULL,
  na.action = na.omit)
# S3 method for harsm
vcov(object, ...)
# S3 method for harsm
print(x, ...)

Arguments

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

group

the string name of the group variable in the data, or a bare character with the group name. The group indices need not be integers, but that is more efficient. They need not be sorted.

weights

an optional vector of weights, or the string or bare name of the weights in the data for use in the fitting process. The weights are attached to the outcomes, not the participant. Set to NULL for none.

na.action

How to deal with missing values in the outcomes, groups, weights, etc.

object

an object of class harsm.

...

additional arguments to be passed to the low level regression fitting functions (see below).

logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.

Value

An object of class harsm, but also of maxLik with the fit.

Details

Performs a softmax regression by groups, via Maximum Likelihood Estimation, under the Harville model. We fit \(\beta\) where odds are \(\eta = x^{\top}\beta\) for independent variables \(x\). The probability of taking first place is then \(\mu=c\exp{\eta}\), where the \(c\) is chosen so the \(\mu\) sum to one. Under the Harville model, conditional on the first place finisher having been observed, the probability model for second (and successive) places with the probabilities of the remaining participants renormalized.

The print method of the harsm object includes a display of the R-squared. This measures the improvement in squared errors of the expected rank from the model over the null model which posits that all odds are equal. When the formula includes an offset, a ‘delta R-squared’ is also output. This is the improvement in predicted ranks over the model based on the offset term. Note that the expected ranks are only easy to produce under the Harville model; under the Henery model, the summary R-squared statistics are not produced. Note that this computation uses weighted sums of squares, as the weights contribute to the likelihood term. However, the square sum computation does not take into account the standard error of the rank, and so unlike in linear regression, the softmax regression does not always give positive R-squareds, and the statistic is otherwise hard to interpret.

Examples

Run this code

# NOT RUN {
nfeat <- 5
set.seed(1234)
g <- ceiling(seq(0.1,1000,by=0.1))
X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat)
beta <- rnorm(nfeat)
eta <- X %*% beta
y <- rsm(eta,g)
# now the pretty frontend
data <- cbind(data.frame(outcome=y,race=g),as.data.frame(X))

fmla <- outcome ~ V1 + V2 + V3 + V4 + V5
fitm <- harsm(fmla,data,group=race)

eta0 <- rowMeans(X)
data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X))
fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5
fitm <- harsm(fmla,data,group=race)

# with weights
data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X))
data$wts <- runif(nrow(data),min=1,max=2)
fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5
fitm <- harsm(fmla,data,group=race,weights=wts)

# softmax on the Best Picture data
data(best_picture)
df <- best_picture
df$place <- ifelse(df$winner,1,2)
df$weight <- ifelse(df$winner,1,0)

fmla <- place ~ nominated_for_BestDirector + nominated_for_BestActor + Drama 

harsm(fmla,data=df,group=year,weights=weight) 

# }
# NOT RUN {
# test against logistic regression
if (require(dplyr)) {
nevent <- 10000
set.seed(1234)
adf <- data_frame(eventnum=floor(seq(1,nevent + 0.7,by=0.5))) %>%
  mutate(x=rnorm(n()),
         program_num=rep(c(1,2),nevent),
         intercept=as.numeric(program_num==1),
         eta=1.5 * x + 0.3 * intercept,
         place=ohenery::rsm(eta,g=eventnum))

# Harville model
modh <- harsm(place ~ intercept + x,data=adf,group=eventnum)

# the collapsed data.frame for glm
ddf <- adf %>%
  arrange(eventnum,program_num) %>%
  group_by(eventnum) %>%
    summarize(resu=as.numeric(first(place)==1),
              delx=first(x) - last(x),
              deli=first(intercept) - last(intercept)) %>%
  ungroup()

# glm logistic fit
modg <- glm(resu ~ delx + 1,data=ddf,family=binomial(link='logit'))

all.equal(as.numeric(coef(modh)),as.numeric(coef(modg)),tolerance=1e-4)
all.equal(as.numeric(vcov(modh)),as.numeric(vcov(modg)),tolerance=1e-4)
}

# }
# NOT RUN {
# }