outlier_detection
provides different types of outlier detection
algorithms depending on the arguments provided. The decision whether to
classify an observations as an outlier or not is based on its standardised
residual in comparison to some user-specified reference distribution.
The algorithms differ mainly in two ways. First, they can differ by the use
of initial estimator, i.e. the estimator based on which the first
classification as outliers is made. Second, the algorithm can either be
iterated a fixed number of times or until the difference in coefficient
estimates between the most recent model and the previous one is smaller than
some user-specified convergence criterion. The difference is measured by
the L2 norm.
outlier_detection(
data,
formula,
ref_dist = c("normal"),
sign_level,
initial_est = c("robustified", "saturated", "user", "iis"),
user_model = NULL,
iterations = 1,
convergence_criterion = NULL,
max_iter = NULL,
shuffle = FALSE,
shuffle_seed = NULL,
split = 0.5,
verbose = FALSE,
iis_args = NULL
)
outlier_detection
returns an object of class
"robust2sls"
, which is a list with the following components:
$cons
A list which stores high-level information about the
function call and some results. $call
is the captured function call,
$formula
the formula argument, $data
the original data set,
$reference
the chosen reference distribution to classify outliers,
$sign_level
the significance level, $psi
the probability that
an observation is not classified as an outlier under the null hypothesis
of no outliers, $cutoff
the cutoff used to classify outliers if
their standardised residuals are larger than that value, $bias_corr
a bias correction factor to account for potential false positives
(observations classified as outliers even though they are not). There are
three further elements that are lists themselves.
$initial
stores settings about the initial estimator:
$estimator
is the type of the initial estimator (e.g. robustified or
saturated), $split
how the sample is split (NULL
if argument
not used), $shuffle
whether the sample is shuffled before splitting
(NULL
if argument not used), $shuffle_seed
the value of the
random seed (NULL
if argument not used).
$convergence
stores information about the convergence of the
outlier-detection algorithm:
$criterion
is the user-specified convergence criterion (NULL
if argument not used), $difference
is the L2 norm between the last
coefficient estimates and the previous ones (NULL
if argument not
used or only initial estimator calculated). $converged
is a logical
value indicating whether the algorithm has converged, i.e. whether the
difference is smaller than the convergence criterion (NULL
if
argument not used). $max_iter
is the maximum iteration set by the
user (NULL
if argument not used or not set).
$iterations
contains information about the user-specified iterations
argument ($setting
) and the actual number of iterations that were
done ($actual
). The actual number can be lower if the algorithm
converged already before the user-specified number of iterations were
reached.
$model
A list storing the model objects of class
ivreg for each iteration. Each model is stored under
$m0
, $m1
, ...
$res
A list storing the residuals of all observations for
each iteration. Residuals of observations where any of the y, x, or z
variables used in the 2SLS model are missing are set to NA. Each vector is
stored under $m0
, $m1
, ...
$stdres
A list storing the standardised residuals of all
observations for each iteration. Standardised residuals of observations
where any of the y, x, or z variables used in the 2SLS model are missing
are set to NA. Standardisation is done by dividing by sigma, which is not
adjusted for degrees of freedom. Each vector is stored under $m0
,
$m1
, ...
$sel
A list of logical vectors storing whether an observation
is included in the estimation or not. Observations are excluded (FALSE) if
they either have missing values in any of the x, y, or z variables needed
in the model or when they are classified as outliers based on the model.
Each vector is stored under $m0
, $m1
, ...
$type
A list of integer vectors indicating whether an
observation has any missing values in x, y, or z (-1
), whether it is
classified as an outlier (0
) or not (1
). Each vector is
stored under $m0
, $m1
, ...
A dataframe.
A formula for the ivreg
function, i.e. in
the format y ~ x1 + x2 | x1 + z2
where y
is the dependent
variable, x1
are the exogenous regressors, x2
the endogenous
regressors, and z2
the outside instruments.
A character vector that specifies the reference distribution
against which observations are classified as outliers. "normal"
refers
to the normal distribution.
A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not.
A character vector that specifies the initial estimator
for the outlier detection algorithm. "robustified"
means that the
full sample 2SLS is used as initial estimator. "saturated"
splits
the sample into two parts and estimates a 2SLS on each subsample. The
coefficients of one subsample are used to calculate residuals and determine
outliers in the other subsample. "user"
allows the user to specify a
model based on which observations are classified as outliers. "iis"
applies impulse indicator saturation (IIS) as implemented in
ivisat
. See section "Warning" for more information
and conditions.
A model object of class ivreg. Only
required if argument initial_est
is set to "user"
, otherwise
NULL
.
Either an integer >= 0 that specifies how often the outlier
detection algorithm is iterated, or the character vector
"convergence"
. In the former case, the value 0
means that only
outlier classification based on the initial estimator is done. In the latter,
the algorithm is iterated until it converges, i.e. when the difference in
coefficient estimates between the most recent model and the previous one is
smaller than some user-specified convergence criterion.
A numeric value or NULL. The algorithm stops as
soon as the difference in coefficient estimates between the most recent model
and the previous one is smaller than convergence_criterion
. The
difference is measured by the L2 norm. If the argument is set to a numeric
value but iterations
is an integer > 0 then the algorithm stops either
when it converged or when iterations
is reached.
A numeric value >= 1 or NULL. If
iterations = "convergence"
is chosen, then the algorithm is stopped
after at most max_iter
iterations. If also a
convergence_criterion
is chosen then the algorithm stops when either
the criterion is fulfilled or the maximum number of iterations is reached.
A logical value or NULL
. Only used if
initial_est == "saturated"
. If TRUE
then the sample is shuffled
before creating the subsamples.
An integer value that will set the seed for shuffling the
sample or NULL
. Only used if initial_est == "saturated"
and
shuffle == TRUE
.
A numeric value strictly between 0 and 1 that determines in which proportions the sample will be split.
A logical value whether progress during estimation should be reported.
A list with named entries corresponding to the arguments for
iis_init
(t.pval
, do.pet
,
normality.JarqueB
, turbo
, overid
, weak
). Can be
NULL
if initial_est != "iis"
.
Check Jiao (2019)
(as well as forthcoming working paper in the future) about conditions on the
initial estimator that should be satisfied for the initial estimator when
using initial_est == "user"
(e.g. they have to be Op(1)).
IIS is a generalisation of Saturated 2SLS
with
multiple block search but no asymptotic theory exists for IIS.