id_estimate: Estimate an `idealstan` model

Description

This function will take a pre-processed idealdata vote/score dataframe and run one of the available IRT/latent space ideal point models on the data using Stan's MCMC engine.

Usage

id_estimate(idealdata = NULL, model_type = 2, inflate_zero = FALSE,
  vary_ideal_pts = "none", use_subset = FALSE, sample_it = FALSE,
  subset_group = NULL, subset_person = NULL, sample_size = 20,
  nchains = 4, niters = 2000, use_vb = FALSE,
  restrict_ind_high = NULL, id_diff = 4, id_diff_high = 2,
  restrict_ind_low = NULL, fixtype = "vb_full", prior_fit = NULL,
  warmup = floor(niters/2), ncores = 4, use_groups = FALSE,
  discrim_reg_sd = 1, discrim_miss_sd = 1, person_sd = 1,
  time_sd = 0.1, sample_stationary = FALSE, ar_sd = 2,
  diff_reg_sd = 1, diff_miss_sd = 1, restrict_sd = 0.01,
  restrict_mean = NULL, restrict_var = NULL,
  restrict_mean_val = NULL, restrict_mean_ind = NULL,
  restrict_var_high = 0.1, ...)

Arguments

idealdata

An object produced by the id_make containing a score/vote matrix for use for estimation & plotting

model_type

An integer reflecting the kind of model to be estimated. See below.

inflate_zero

If the outcome is distributed as Poisson (count/unbounded integer), setting this to TRUE will fit a traditional zero-inflated model. To use correctly, the value for zero must be passed as the miss_val option to id_make before running a model so that zeroes are coded as missing data.

vary_ideal_pts

Default 'none'. If 'random_walk' or 'AR1', a time-varying ideal point model will be fit with either a random-walk process or an AR1 process. See documentation for more info.

use_subset

Whether a subset of the legislators/persons should be used instead of the full response matrix

sample_it

Whether or not to use a random subsample of the response matrix. Useful for testing.

subset_group

If person/legislative data was included in the id_make function, then you can subset by any value in the $group column of that data if use_subset is TRUE.

subset_person

A list of character values of names of persons/legislators to use to subset if use_subset is TRUE and person/legislative data was included in the id_make function with the required $person.names column

sample_size

If sample_it is TRUE, this value reflects how many legislators/persons will be sampled from the response matrix

nchains

The number of chains to use in Stan's sampler. Minimum is one. See stan for more info.

niters

The number of iterations to run Stan's sampler. Shouldn't be set much lower than 500. See stan for more info.

use_vb

Whether or not to use Stan's variational Bayesian inference engine instead of full Bayesian inference. Pros: it's much faster. Cons: it's not quite as accurate. See vb for more info.

restrict_ind_high

If fixtype is not "vb", the particular indices of legislators/persons or bills/items to constrain high

id_diff

The fixed difference between the high/low person/legislator ideal points used to identify the model. Set at 4 as a standard value but can be changed to any arbitrary number without affecting model results besides re-scaling.

id_diff_high

The fixed intercept of the high ideal point used to constrain the model.

restrict_ind_low

If fixtype is not "vb", the particular indices of legislators/persons or bills/items to constrain low. (Note: not used if values are pinned).

fixtype

Sets the particular kind of identification used on the model, could be one of 'vb_full' (identification provided exclusively by running a variational identification model with no prior info), 'vb_partial' (two indices of ideal points to fix are provided but the values to fix are determined by the identification model), 'constrain' (two indices of ideal points to fix are provided--only sufficient for model if restrict_var is FALSE, and 'prior_fit' (a previous identified idealstan fit is passed to the prior_fit option and used as the basis for identification). See details for more information.

prior_fit

If a previous idealstan model was fit with the same data, then the same identification constraints can be recycled from the prior fit if the idealstan object is passed to this option. Note that means that all identification options, like restrict_var, will also be the same

warmup

The number of iterations to use to calibrate Stan's sampler on a given model. Shouldn't be less than 100. See stan for more info.

ncores

The number of cores in your computer to use for parallel processing in the Stan engine. See stan for more info.

use_groups

If TRUE, group parameters from the person/legis data given in id_make will be estimated instead of individual parameters.

discrim_reg_sd

Set the prior standard deviation of the bimodal prior for the discrimination parameters for the non-inflated model.

discrim_miss_sd

Set the prior standard deviation of the bimodal prior for the discrimination parameters for the inflated model.

person_sd

Set the prior standard deviation for the legislators (persons) parameters

time_sd

The precision (inverse variance) of the over-time component of the person/legislator parameters. A higher value will allow for less over-time variation (useful if estimates bounce too much). Default is 4.

sample_stationary

If TRUE, the AR(1) coefficients in a time-varying model will be sampled from an unconstrained space and then mapped back to a stationary space. Leaving this TRUE is slower but will work better when there is limited information to identify a model. If used, the ar_sd parameter should be increased to 5 to allow for wider sampling in the unconstrained space.

ar_sd

If an AR(1) model is used, this defines the prior scale of the Normal distribution. A lower number can help identify the model when there are few time points.

diff_reg_sd

Set the prior standard deviation for the bill (item) intercepts for the non-inflated model.

diff_miss_sd

Set the prior standard deviation for the bill (item) intercepts for the inflated model.

restrict_sd

Set the prior standard deviation for constrained parameters

restrict_mean

Whether or not to restrict the over-time mean of an ideal point (additional identification measure when standard fixes don't work). TRUE by default for random-walk models.

restrict_var

Whether to limit variance to no higher than 0.5 for random-walk time series models. If left blank (the default), will be set to TRUE for random-walk models and FALSE for AR(1) models if identification is still a challenge (note: using this for AR(1) models is probably overkill).

restrict_mean_val

For random-walk models, the mean of a time-series ideal point to constrain. Should not be set a priori (leave blank) unless you are absolutely sure. Otherwise it is set by the identification model.

restrict_mean_ind

For random-walk models, the ID of the person/group whose over-time mean to constrain. Should be left blank (will be set by identification model) unless you are really sure.

restrict_var_high

The upper limit for the variance parameter (if restrict_var=TRUE & model is a random-walk time-series). If left blank, either defaults to 0.1 or is set by identification model.

...

Additional parameters passed on to Stan's sampling engine. See stan for more information.

Value

A fitted idealstan object that contains posterior samples of all parameters either via full Bayesian inference or a variational approximation if use_vb is set to TRUE. This object can then be passed to the plotting functions for further analysis.

Identification

Identifying IRT models is challenging, and ideal point models are still more challenging because the discrimination parameters are not constrained. As a result, more care must be taken to obtain estimates that are the same regardless of starting values. The parameter fixtype enables you to change the type of identification used. The default, 'vb_full', does not require any further information from you in order for the model to be fit. In this version of identification, an unidentified model is run using variational Bayesian inference (see vb). The function will then select two persons/legislators that end up on either end of the ideal point spectrum, and pin their ideal points to those specific values. This is sufficient to identify all of the static models and also the AR(1) time-varying models. For random-walk time-varying models, identification is more difficult (see vignette). Setting the option restrict_mean to TRUE will implement additional identification constraints on random-walk models. A particularly convenient option for fixtype is 'vb_partial'. In this case, the user should pass the IDs (as a character vector) of the persons to constrain high (restrict_ind_high) and low (restrict_ind_low). A model will then be fit to find the likely positions of these parameters, which will then be used to fit an identified model. In this way, the user can achieve a certain shape of the ideal point distribution without needing to choose specific values ahead of time to pin parameters to. If a prior model has been estimated with the same data, the user can re-use those identification settings by passing the fitted idealstan object to the prior_fit option.

Details

To run an IRT ideal point model, you must first pre-process your data using the id_make function. Be sure to specify the correct options for the kind of model you are going to run: if you want to run an unbounded outcome (i.e. Poisson or continuous), the data needs to be processed differently. Also any hierarchical covariates at the person or item level need to be specified in id_make. If they are specified in id_make, than all subsequent models fit by this function will have these covariates.

Note that for static ideal point models, the covariates are only defined for those persons who are not being used as constraints.

As of this version of idealstan, the following model types are available. Simply pass the number of the model in the list to the model_type option to fit the model.

IRT 2-PL (binary response) ideal point model, no missing-data inflation
IRT 2-PL ideal point model (binary response) with missing- inflation
Ordinal IRT (rating scale) ideal point model no missing-data inflation
Ordinal IRT (rating scale) ideal point model with missing-data inflation
Ordinal IRT (graded response) ideal point model no missing-data inflation
Ordinal IRT (graded response) ideal point model with missing-data inflation
Poisson IRT (Wordfish) ideal point model with no missing data inflation
Poisson IRT (Wordfish) ideal point model with missing-data inflation
unbounded (Gaussian) IRT ideal point model with no missing data
unbounded (Gaussian) IRT ideal point model with missing-data inflation
Positive-unbounded (Log-normal) IRT ideal point model with no missing data
Positive-unbounded (Log-normal) IRT ideal point model with missing-data inflation
Latent Space (binary response) ideal point model with no missing data
Latent Space (binary response) ideal point model with missing-data inflation

In addition, each of these models can have time-varying ideal point (person) parameters if a column of dates is fed to the id_make function. If the option vary_ideal_pts is set to 'random_walk', id_estimate will estimate a random-walk ideal point model where ideal points move in a random direction. If vary_ideal_pts is set to 'AR1', a stationary ideal point model is estimated where ideal points fluctuate around long-term mean. In general, the stationary model is preferred when the time series is of short absolute duration (such as days or hours) while the random-walk model is preferable when the time series is of very long duration and there are no natural limits to the ideal points. Please see the package vignette and associated paper for more detail about these time-varying models.

The inflation model used to account for missing data assumes that missingness is a function of the persons' (legislators') ideal points. In other words,the model will take into account if people with high or low ideal points tend to have more/less missing data on a specific item/bill. Missing data is whatever was passed as miss_val to the id_make function. If there isn't any relationship between missing data and ideal points, then the model assumes that the missingness is ignorable conditional on each item, but it will still adjust the results to reflect these ignorable (random) missing values. The inflation is designed to be general enough to handle a wide array of potential situations where strategic social choices make missing data important to take into account.

The missing data is assumed to be any possible value of the outcome. The well-known zero-inflated Poisson model is a special case where missing values are known to be all zeroes. To fit a zero-inflated Poisson model, change inflate_zeroes to TRUE and also make sure to set the value for zero as miss_val in the id_make function. This will only work for outcomes that are distributed as Poisson variables (i.e., unbounded integers or counts).

To leave missing data out of the model, simply choose a version of the model in the list above that is non-inflated.

Models can be either fit on the person/legislator IDs or on group-level IDs (as specified to the id_make function). If group-level parameters should be fit, set use_groups to TRUE.

References

Clinton, J., Jackman, S., & Rivers, D. (2004). The Statistical Analysis of Roll Call Data. The American Political Science Review, 98(2), 355-370. doi:10.1017/S0003055404001194
Bafumi, J., Gelman, A., Park, D., & Kaplan, N. (2005). Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation. Political Analysis, 13(2), 171-187. doi:10.1093/pan/mpi010
Kubinec, R. "Generalized Ideal Point Models for Time-Varying and Missing-Data Inference". Working Paper.

Examples

Run this code

# NOT RUN {
# First we can simulate data for an IRT 2-PL model that is inflated for missing data
library(ggplot2)
library(dplyr)

# This code will take at least a few minutes to run 
# }
# NOT RUN {
bin_irt_2pl_abs_sim <- id_sim_gen(model_type='binary',inflate=T)

# Now we can put that directly into the id_estimate function 
# to get full Bayesian posterior estimates
# We will constrain discrimination parameters 
# for identification purposes based on the true simulated values

bin_irt_2pl_abs_est <- id_estimate(bin_irt_2pl_abs_sim,
                       model_type=2,
                       restrict_ind_high = 
                       sort(bin_irt_2pl_abs_sim@simul_data$true_person,
                       decreasing=TRUE,
                       index=TRUE)$ix[1],
                       restrict_ind_low = 
                       sort(bin_irt_2pl_abs_sim@simul_data$true_person
                       decreasing=FALSE,
                       index=TRUE)$ix[1],
                       fixtype='vb_partial',
                       ncores=2,
                       nchains=2)
                                   
# We can now see how well the model recovered the true parameters

id_sim_coverage(bin_irt_2pl_abs_est) %>% 
         bind_rows(.id='Parameter') %>% 
         ggplot(aes(y=avg,x=Parameter)) +
           stat_summary(fun.args=list(mult=1.96)) + 
           theme_minimal()
 
# }
# NOT RUN {
# In most cases, we will use pre-existing data 
# and we will need to use the id_make function first
# We will use the full rollcall voting data 
# from the 114th Senate as a rollcall object

data('senate114')

# Running this model will take at least a few minutes, even with 
# variational inference (use_vb=T) turned on
# }
# NOT RUN {
to_idealstan <-   id_make(score_data = senate114,
outcome = 'cast_code',
person_id = 'bioname',
item_id = 'rollnumber',
group_id= 'party_code',
time_id='date',
high_val='Yes',
low_val='No',
miss_val='Absent')

sen_est <- id_estimate(senate_data,
model_type = 2,
use_vb = TRUE,
fixtype='vb_partial',
restrict_ind_high = "BARRASSO, John A.",
restrict_ind_low = "WARREN, Elizabeth")

# After running the model, we can plot 
# the results of the person/legislator ideal points

id_plot_legis(sen_est)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab