ph_anomaly: Detect anomalies.

Description

The ph_anomaly function detects and removes anomalies with an autoencoder. Because it is general purpose, it can be applied to a variety of data types. The parameters in this function (e.g., activation, hidden, dropout_ratio) can be supplied as lists or vectors (see parameter details) to perform a grid search for the optimal hyperparameter combination. The autoencoder with the lowest reconstruction error is selected as the best model.

Usage

ph_anomaly(
  df,
  ids_col,
  class_col,
  method = "ae",
  scale = FALSE,
  center = NULL,
  sd = NULL,
  max_mem_size = "15g",
  port = 54321,
  train_seed = 123,
  hyper_params = list(),
  search = "random",
  tune_length = 100
)

Value

A list containing the following components:

`df`	The data frame with anomalies removed.

`model`	The best model from the grid search used to detect anomalies.

`anom_score`	A data frame of predicted anomaly scores.

Arguments

df

A data.frame containing a column of ids, a column of classes, and an arbitrary number of predictors.

ids_col

A character value for the name of the ids column.

class_col

A character value for the name of the class column.

method

A character value for the anomaly detection method: "ae" (default), "iso" (abbv. for extended isolation forest).

scale

A logical value for whether to scale the data: FALSE (default). Recommended if method = "ae" and if user intends to train other models.

center

Either a logical value or numeric-alike vector of length equal to the number of columns of data to scale in df, where ‘numeric-alike’ means that as.numeric(.) will be applied successfully if is.numeric(.) is not true: NULL (default). If scale = TRUE, this is set to TRUE and is used to subtract the mean.

sd

Either a logical value or a numeric-alike vector of length equal to the number of columns of data to scale in df: NULL (default). If scale = TRUE, this is set to TRUE and is used to divide by the standard deviation.

max_mem_size

A character value for the memory of an h2o session: "15g" (default).

port

A numeric value for the port number of the H2O server.

train_seed

A numeric value to set the control the randomness of creating resamples: 123 (default).

hyper_params

A list of hyperparameters to perform a grid search.

If method = "ae", the "default" list is: list(missing_values_handling = "Skip", activation = c("Rectifier", "Tanh"), hidden = list(5, 25, 50, 100, 250, 500, nrow(df_h2o)), input_dropout_ratio = c(0, 0.1, 0.2, 0.3), rate = c(0, 0.01, 0.005, 0.001))
If method = "iso", the "default" list is: list(ntrees = c(50, 100, 150, 200), sample_size = c(64, 128, 256, 512))

search

A character value for the hyperparameter search strategy: "random" (default), "grid".

tune_length

A numeric value (integer) for either the maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid"): 100 (default).

Examples

Run this code

## Import data.
data(ph_crocs)
# \donttest{
## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Alternatively, remove anomalies with extended isolation forest. Notice
## that port is defined, because running H2O sessions one after another
## can return connection errors.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "iso",
                      port = 50001)
# }

Run the code above in your browser using DataLab