outlier_detection: Outlier detection algorithms

Description

outlier_detection provides different types of outlier detection algorithms depending on the arguments provided. The decision whether to classify an observations as an outlier or not is based on its standardised residual in comparison to some user-specified reference distribution.
The algorithms differ mainly in two ways. First, they can differ by the use of initial estimator, i.e. the estimator based on which the first classification as outliers is made. Second, the algorithm can either be iterated a fixed number of times or until the difference in coefficient estimates between the most recent model and the previous one is smaller than some user-specified convergence criterion. The difference is measured by the L2 norm.

Usage

outlier_detection(
  data,
  formula,
  ref_dist = c("normal"),
  sign_level,
  initial_est = c("robustified", "saturated", "user", "iis"),
  user_model = NULL,
  iterations = 1,
  convergence_criterion = NULL,
  max_iter = NULL,
  shuffle = FALSE,
  shuffle_seed = NULL,
  split = 0.5,
  verbose = FALSE,
  iis_args = NULL
)

Value

outlier_detection returns an object of class "robust2sls", which is a list with the following components:

$cons: A list which stores high-level information about the function call and some results. $call is the captured function call, $formula the formula argument, $data the original data set, $reference the chosen reference distribution to classify outliers, $sign_level the significance level, $psi the probability that an observation is not classified as an outlier under the null hypothesis of no outliers, $cutoff the cutoff used to classify outliers if their standardised residuals are larger than that value, $bias_corr a bias correction factor to account for potential false positives (observations classified as outliers even though they are not). There are three further elements that are lists themselves.

$initial stores settings about the initial estimator: $estimator is the type of the initial estimator (e.g. robustified or saturated), $split how the sample is split (NULL if argument not used), $shuffle whether the sample is shuffled before splitting (NULL if argument not used), $shuffle_seed the value of the random seed (NULL if argument not used).

$convergence stores information about the convergence of the outlier-detection algorithm: $criterion is the user-specified convergence criterion (NULL if argument not used), $difference is the L2 norm between the last coefficient estimates and the previous ones (NULL if argument not used or only initial estimator calculated). $converged is a logical value indicating whether the algorithm has converged, i.e. whether the difference is smaller than the convergence criterion (NULL if argument not used). $max_iter is the maximum iteration set by the user (NULL if argument not used or not set).

$iterations contains information about the user-specified iterations argument ($setting) and the actual number of iterations that were done ($actual). The actual number can be lower if the algorithm converged already before the user-specified number of iterations were reached.
$model: A list storing the model objects of class ivreg for each iteration. Each model is stored under $m0, $m1, ...
$res: A list storing the residuals of all observations for each iteration. Residuals of observations where any of the y, x, or z variables used in the 2SLS model are missing are set to NA. Each vector is stored under $m0, $m1, ...
$stdres: A list storing the standardised residuals of all observations for each iteration. Standardised residuals of observations where any of the y, x, or z variables used in the 2SLS model are missing are set to NA. Standardisation is done by dividing by sigma, which is not adjusted for degrees of freedom. Each vector is stored under $m0, $m1, ...
$sel: A list of logical vectors storing whether an observation is included in the estimation or not. Observations are excluded (FALSE) if they either have missing values in any of the x, y, or z variables needed in the model or when they are classified as outliers based on the model. Each vector is stored under $m0, $m1, ...
$type: A list of integer vectors indicating whether an observation has any missing values in x, y, or z (-1), whether it is classified as an outlier (0) or not (1). Each vector is stored under $m0, $m1, ...

Arguments

data: A dataframe.
formula: A formula for the ivreg function, i.e. in the format y ~ x1 + x2 | x1 + z2 where y is the dependent variable, x1 are the exogenous regressors, x2 the endogenous regressors, and z2 the outside instruments.
ref_dist: A character vector that specifies the reference distribution against which observations are classified as outliers. "normal" refers to the normal distribution.
sign_level: A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not.
initial_est: A character vector that specifies the initial estimator for the outlier detection algorithm. "robustified" means that the full sample 2SLS is used as initial estimator. "saturated" splits the sample into two parts and estimates a 2SLS on each subsample. The coefficients of one subsample are used to calculate residuals and determine outliers in the other subsample. "user" allows the user to specify a model based on which observations are classified as outliers. "iis" applies impulse indicator saturation (IIS) as implemented in ivisat. See section "Warning" for more information and conditions.
user_model: A model object of class ivreg. Only required if argument initial_est is set to "user", otherwise NULL.
iterations: Either an integer >= 0 that specifies how often the outlier detection algorithm is iterated, or the character vector "convergence". In the former case, the value 0 means that only outlier classification based on the initial estimator is done. In the latter, the algorithm is iterated until it converges, i.e. when the difference in coefficient estimates between the most recent model and the previous one is smaller than some user-specified convergence criterion.
convergence_criterion: A numeric value or NULL. The algorithm stops as soon as the difference in coefficient estimates between the most recent model and the previous one is smaller than convergence_criterion. The difference is measured by the L2 norm. If the argument is set to a numeric value but iterations is an integer > 0 then the algorithm stops either when it converged or when iterations is reached.
max_iter: A numeric value >= 1 or NULL. If iterations = "convergence" is chosen, then the algorithm is stopped after at most max_iter iterations. If also a convergence_criterion is chosen then the algorithm stops when either the criterion is fulfilled or the maximum number of iterations is reached.
shuffle: A logical value or NULL. Only used if initial_est == "saturated". If TRUE then the sample is shuffled before creating the subsamples.
shuffle_seed: An integer value that will set the seed for shuffling the sample or NULL. Only used if initial_est == "saturated" and shuffle == TRUE.
split: A numeric value strictly between 0 and 1 that determines in which proportions the sample will be split.
verbose: A logical value whether progress during estimation should be reported.
iis_args: A list with named entries corresponding to the arguments for iis_init (t.pval, do.pet, normality.JarqueB, turbo, overid, weak). Can be NULL if initial_est != "iis".

Warning

Check Jiao (2019) (as well as forthcoming working paper in the future) about conditions on the initial estimator that should be satisfied for the initial estimator when using initial_est == "user" (e.g. they have to be Op(1)). IIS is a generalisation of Saturated 2SLS with multiple block search but no asymptotic theory exists for IIS.