detect_outliers: Detects unreliable outliers in univariate time series data based on model-based clustering

Description

This function applies finite mixture modelling to compute the probability of each observation being outliying data in an univariate time series. By utilizing the Mclust package the data is assigned in G clusters whereof one is modelled as an outlier cluster. The clustering process is based on features, which are modelled to differentiate normal from outlying observation.Beside computing the probability of each observation being outlying data also the specific cause in terms of the responsible feature/ feature combination can be provided.

Usage

detect_outliers(
  data,
  S,
  proba = 0.5,
  share = NULL,
  repetitions = 10,
  decomp = T,
  PComp = F,
  detection.parameter = 1,
  out.par = 2,
  max.cluster = 9,
  G = NULL,
  modelName = "VVV",
  feat.inf = F,
  ext.val = 1,
  ...
)

Arguments

data

an one dimensional matrix or data frame without missing data; each row is an observation.

vector with numeric values for each seasonality present in data.

proba

denotes the threshold from which on an observation is considered as being outlying data. By default is set to 0.5 (ranging from 0 to 1). Number of outliers increases with decrease of proba threshold.

controlls the size of the subsample used for estimation. By default set to pmin(2*round(length(data)^(sqrt(2)/2)), length(data))/length(data) (ranging from 0 to 1). In combination with the repetitions parameter the robustness and computational time of the method can be controlled.

repetitions

denotes the number of repetitions to repeat the clustering. By default set to 10. Allows to control the robustness and computational time of the method.

decomp

allows to perform seasonal decomposition on the original time series as pre- processing step before feature modelling. By default set to TRUE.

PComp

allows to use the principal components of the modelled feature matrix. By default set to FALSE.

detection.parameter

denotes a parameter to regulate the detection sensitivity. By default set to 1. It is assumed that the outlier cluster follows a (multivariate) Gaussian distribution parameterized by sample mean and a blown up sample covariance matrix of the feature space. The covariance matrix is blown up by detection.parameter * (2 * log(length(data)))^2. By increase the more extrem outliers are detected.

out.par

controls the number of artifially produced outliers to allow cluster formation of oultier cluster. By default out.par ist set to 2. By increase it is assumed that share of outliers in data increases. A priori it is assumed that out.par * ceiling(sqrt(nrow(data.original))) number of observations are outlying observations.

max.cluster

a single numeric value controlling the maximum number of allowed clusters. By default set to 9.

denotes the optimal number of clusters limited by the max.cluster paramter. By default G is set to NULL and is automatically calculated based on the BIC.

modelName

denotes the geometric features of the covariance matrix. i.e. "EII", "VII", "EEI", "EVI", "VEI", "VVI", etc.. By default modelName is set to "VVV". The help file for mclustModelNames describes the available models. Choice of modelName influences the fit to the data as well as the computational time.

feat.inf

logical value indicating whether influential features/ feature combinations should be computed. By default set to FALSE.

ext.val

denotes the number of observations for each side of an identified outlier, which should also be treated as outliyng data. By default set to 1.

...

additional arguments for the Mclust function.

Value

a list containing the following elements:

data

numeric vector containing the original data.

outlier.pos

a vector indicating the position of each outlier and the corresponding neighboorhood controled by ext.val.

outlier.pos.raw

a vector indicating the position of each outlier.

outlier.probs

a vector containing all probabilities for each observation being outlying data.

Repetitions

provides a list for each repetition containing the estimated model, the outlier cluster, the probabilities for each observation belonging to the estimated clusters, the outlier position, the influence of each feature/ feature combination on the identified outyling data, and the corresponding probabilities after shift to the feature mean of each considered outlier, as well as the applied subset of the extended feature matrix for estimation (including artificially introduced outliers).

features

a matrix containg the feature matrix. Each column is a feature.

inf.feature.combinations

a list containg the features/ feature comibinations, which caused assignment to outlier cluster.

feature.inf.tab

a matrix containing all possible feature combinations.

an object of class "princomp" containing the principal component analysis of the feature matrix.

Details

The detection of outliers is addressed by model based clustering based on parameterized finite Gaussian mixture models. For cluster estimation the Mclust function is applied. Models are estimated by the EM algorithm initialized by hierarchical model-based agglomerative clustering. The optimal model is selected according to BIC. *tsrobprep The following features based on the introduced data are used in the clustering process:

org.series: denotes the scaled and potantially decomposed original time series.
seasonality: denotes determenistic seasonalities based on S.
gradient: denotes the summation of the two sided gradient of the org.series.
abs.gradient: denotes the summation of the absolute two sided gradient of org.series.
rel.gradient: denotes the summation of the two sided absolute gradient of the org.series with sign based on left sided gradient in relation to the rolling mean absolut deviation based on most relevant seasonality S.
abs.seas.grad: denotes the summation of the absolute two sided seasonal gradient of org.series based on seasonalties S.

In case PComp = TRUE, the features correspond to the principal components of the introduced feature space.

References

Examples

Run this code

# NOT RUN {
set.seed(1)
id <- 14000:17000
# Replace missing values
modelmd <- model_missing_data(data = GBload[id, -1], tau = 0.5,
                              S = c(48, 336), indices.to.fix = seq_len(nrow(GBload[id, ])),
                              consider.as.missing = 0, min.val = 0)
# Impute missing values
data.imputed <- impute_modelled_data(modelmd)

#Detect outliers
system.time(
  o.ident <- detect_outliers(data = data.imputed, S = c(48, 336))
)

# Plot of identified outliers in time series
outlier.vector <- rep(F,length(data.imputed))
outlier.vector[o.ident$outlier.pos] <- T
plot(data.imputed, type = "o", col=1 + 1 * outlier.vector,
     pch = 1 + 18 * outlier.vector)

# table of identified raw outliers and corresponding probs being outlying data
df <- data.frame(o.ident$outlier.pos.raw,unlist(o.ident$outlier.probs)[o.ident$outlier.pos.raw])
colnames(df) <- c("Outlier position", "Probability of being outlying data")
df

# Plot of feature matrix
plot.ts(o.ident$features, type = "o",
        col = 1 + outlier.vector,
        pch = 1 + 1 * outlier.vector)

# table of outliers and corresponding features/ feature combinations,
# which caused assignment to outlier cluster
# Detect outliers with feat.int = T
set.seed(1)
system.time(
  o.ident <- detect_outliers(data = data.imputed, S = c(48, 336), feat.inf = T)
)
feature.imp <- unlist(lapply(o.ident$inf.feature.combinations,
                             function(x) paste(o.ident$feature.inf.tab[x], collapse = " | ")))

df <- data.frame(o.ident$outlier.pos.raw,o.ident$outlier.probs[o.ident$outlier.pos.raw],
                 feature.imp[as.numeric(names(feature.imp)) %in% o.ident$outlier.pos.raw])
colnames(df) <- c("Outlier position", "Probability being outlying data", "Responsible features")
View(df)
# }

Run the code above in your browser using DataLab