outlier_detection provides different types of outlier detection
algorithms depending on the arguments provided. The decision whether to
classify an observations as an outlier or not is based on its standardised
residual in comparison to some user-specified reference distribution.
The algorithms differ mainly in two ways. First, they can differ by the use
of initial estimator, i.e. the estimator based on which the first
classification as outliers is made. Second, the algorithm can either be
iterated a fixed number of times or until the difference in coefficient
estimates between the most recent model and the previous one is smaller than
some user-specified convergence criterion. The difference is measured by
the L2 norm.
outlier_detection(
data,
formula,
ref_dist = c("normal"),
sign_level,
initial_est = c("robustified", "saturated", "user", "iis"),
user_model = NULL,
iterations = 1,
convergence_criterion = NULL,
max_iter = NULL,
shuffle = FALSE,
shuffle_seed = NULL,
split = 0.5,
verbose = FALSE,
iis_args = NULL
)outlier_detection returns an object of class
"robust2sls", which is a list with the following components:
$consA list which stores high-level information about the
function call and some results. $call is the captured function call,
$formula the formula argument, $data the original data set,
$reference the chosen reference distribution to classify outliers,
$sign_level the significance level, $psi the probability that
an observation is not classified as an outlier under the null hypothesis
of no outliers, $cutoff the cutoff used to classify outliers if
their standardised residuals are larger than that value, $bias_corr
a bias correction factor to account for potential false positives
(observations classified as outliers even though they are not). There are
three further elements that are lists themselves.
$initial stores settings about the initial estimator:
$estimator is the type of the initial estimator (e.g. robustified or
saturated), $split how the sample is split (NULL if argument
not used), $shuffle whether the sample is shuffled before splitting
(NULL if argument not used), $shuffle_seed the value of the
random seed (NULL if argument not used).
$convergence stores information about the convergence of the
outlier-detection algorithm:
$criterion is the user-specified convergence criterion (NULL
if argument not used), $difference is the L2 norm between the last
coefficient estimates and the previous ones (NULL if argument not
used or only initial estimator calculated). $converged is a logical
value indicating whether the algorithm has converged, i.e. whether the
difference is smaller than the convergence criterion (NULL if
argument not used). $max_iter is the maximum iteration set by the
user (NULL if argument not used or not set).
$iterations contains information about the user-specified iterations
argument ($setting) and the actual number of iterations that were
done ($actual). The actual number can be lower if the algorithm
converged already before the user-specified number of iterations were
reached.
$modelA list storing the model objects of class
ivreg for each iteration. Each model is stored under
$m0, $m1, ...
$resA list storing the residuals of all observations for
each iteration. Residuals of observations where any of the y, x, or z
variables used in the 2SLS model are missing are set to NA. Each vector is
stored under $m0, $m1, ...
$stdresA list storing the standardised residuals of all
observations for each iteration. Standardised residuals of observations
where any of the y, x, or z variables used in the 2SLS model are missing
are set to NA. Standardisation is done by dividing by sigma, which is not
adjusted for degrees of freedom. Each vector is stored under $m0,
$m1, ...
$selA list of logical vectors storing whether an observation
is included in the estimation or not. Observations are excluded (FALSE) if
they either have missing values in any of the x, y, or z variables needed
in the model or when they are classified as outliers based on the model.
Each vector is stored under $m0, $m1, ...
$typeA list of integer vectors indicating whether an
observation has any missing values in x, y, or z (-1), whether it is
classified as an outlier (0) or not (1). Each vector is
stored under $m0, $m1, ...
A dataframe.
A formula for the ivreg function, i.e. in
the format y ~ x1 + x2 | x1 + z2 where y is the dependent
variable, x1 are the exogenous regressors, x2 the endogenous
regressors, and z2 the outside instruments.
A character vector that specifies the reference distribution
against which observations are classified as outliers. "normal" refers
to the normal distribution.
A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not.
A character vector that specifies the initial estimator
for the outlier detection algorithm. "robustified" means that the
full sample 2SLS is used as initial estimator. "saturated" splits
the sample into two parts and estimates a 2SLS on each subsample. The
coefficients of one subsample are used to calculate residuals and determine
outliers in the other subsample. "user" allows the user to specify a
model based on which observations are classified as outliers. "iis"
applies impulse indicator saturation (IIS) as implemented in
ivisat. See section "Warning" for more information
and conditions.
A model object of class ivreg. Only
required if argument initial_est is set to "user", otherwise
NULL.
Either an integer >= 0 that specifies how often the outlier
detection algorithm is iterated, or the character vector
"convergence". In the former case, the value 0 means that only
outlier classification based on the initial estimator is done. In the latter,
the algorithm is iterated until it converges, i.e. when the difference in
coefficient estimates between the most recent model and the previous one is
smaller than some user-specified convergence criterion.
A numeric value or NULL. The algorithm stops as
soon as the difference in coefficient estimates between the most recent model
and the previous one is smaller than convergence_criterion. The
difference is measured by the L2 norm. If the argument is set to a numeric
value but iterations is an integer > 0 then the algorithm stops either
when it converged or when iterations is reached.
A numeric value >= 1 or NULL. If
iterations = "convergence" is chosen, then the algorithm is stopped
after at most max_iter iterations. If also a
convergence_criterion is chosen then the algorithm stops when either
the criterion is fulfilled or the maximum number of iterations is reached.
A logical value or NULL. Only used if
initial_est == "saturated". If TRUE then the sample is shuffled
before creating the subsamples.
An integer value that will set the seed for shuffling the
sample or NULL. Only used if initial_est == "saturated" and
shuffle == TRUE.
A numeric value strictly between 0 and 1 that determines in which proportions the sample will be split.
A logical value whether progress during estimation should be reported.
A list with named entries corresponding to the arguments for
iis_init (t.pval, do.pet,
normality.JarqueB, turbo, overid, weak). Can be
NULL if initial_est != "iis".
Check Jiao (2019)
(as well as forthcoming working paper in the future) about conditions on the
initial estimator that should be satisfied for the initial estimator when
using initial_est == "user" (e.g. they have to be Op(1)).
IIS is a generalisation of Saturated 2SLS with
multiple block search but no asymptotic theory exists for IIS.