Function used to compute scores of nominal outlyingness for datasets consisting of nominal features. The computation is done using the score of costa_novel_2025;textualSONO, defined as follows for an observation \(\boldsymbol{x}_i\): $$s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq \boldsymbol{x}_{i}: \\ \text{supp}(d) \notin (\sigma_d, n], \\ \lvert d \rvert \leq \mathrm{MAXLEN}}} \frac{\sigma_d}{\text{supp}(d) \times \lvert d \rvert^r}, \\ r> 0, \ i=1,\dots,n,$$ for highly infrequent itemsets and: $$s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq \boldsymbol{x}_{i}: \\ \text{supp}(d) \notin [0, \sigma_d), \\ \lvert d \rvert \leq \mathrm{MAXLEN}}} \frac{\text{supp}(d)}{\sigma_d \times \left( \text{MAXLEN} - \lvert d \rvert + 1 \right)^r}, \\ r> 0, \ i=1,\dots,n,$$ for highly frequent itemsets. In the above, \(\text{supp}(d)\) is the support of itemset \(d\), \(\sigma_d\) is the the maximum/minimum support threshold and \(\text{MAXLEN}\) is the maximum length of sequences considered, while \(r\) is an exponent term to be determined by the user.
sono(
data,
probs,
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE,
verbose = TRUE
)
A list with 4 elements. The first element is the value of MAXLEN. The second element corresponds to a data frame with 2 columns; one for the observation numbers and one with the final score of outlyingness. The third and fourth elements are the matrix of variable contributions and the nominal outlyingness depths vector, respectively.
Dataset; needs to be of class data.frame and consist of factor variables only.
List of probability vectors for each variable. Each element of the list must include as many probabilities as the number of levels associated with it in the dataset.
Significance level for the simultaneous Multinomial confidence intervals constructed, determining what the frequency thresholds should be for itemsets of different length, used for outlier detection for discrete features. Must be a positive real, at most equal to 0.50. A greater value leads to a much more conservative algorithm. Default value is 0.01.
Exponent term in the computation of scores. Must be a non-negative number. The greater its value, the less contribution itemsets of greater length will have in the overall score. It is suggested that this is not much larger than 3. Default value is 2.
Maximum itemset sequence length to be considered. Default value is 0 which calculates MAXLEN according to a criterion on the sparsity caused by the total combinations that can be encountered as sequences of greater length are taken into account. Otherwise, MAXLEN can take any value from 1 up to the total number of discrete variables included in the data set. If user-given MAXLEN is larger than the estimated value, MAXLEN will default to the latter and a warning message will be displayed, so that redunand computations are avoided.
Logical determining whether highly frequent or highly infrequent itemsets are considered as outliers. Defaults to FALSE, treating highly infrequent itemsets are outlying.
Defaults to TRUE to print progress messages. Change to FALSE to suppress.
costa_novel_2025SONO
dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
sono(data = dt,
probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)),
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE)
Run the code above in your browser using DataLab