CASMI.selectFeatures: CASMI Feature Selection

Description

Selects features that are associated with an outcome while taking into account a sample coverage penalty and feature redundancy. It automatically determines the number of features to be selected, and the chosen features are ranked. (Synonyms for "feature" in this document: "independent variable," "factor," and "predictor.")
For additional information, please refer to the publication: Shi, J., Zhang, J. and Ge, Y. (2019), "CASMI—An Entropic Feature Selection Method in Turing’s Perspective" <doi:10.3390/e21121179>

Usage

CASMI.selectFeatures(
  data,
  NA.handle = "stepwise",
  alpha = 0.05,
  alpha.ind = 0.1,
  intermediate.steps = FALSE,
  kappa.star.cap = 1,
  feature.num.cap = ncol(data)
)

Value

`CASMI.selectFeatures()` returns the following components:

`Outcome`: Name of the outcome variable (last column) in the input dataset.
`Conf.Level`: Confidence level used for the results.
`KappaStar`: The estimated `kappa*` of all selected features. A larger `kappa*` indicates that the selected features have a stronger association with the outcome.
`KappaStarCI`: The confidence interval of `kappa*` for all selected features.
`Results`: A results data frame. The selected features are ranked.
`Var.Idx`: Column index of the selected feature.
`n`: Number of observations used in the analysis.
`cml.kappa*`: The estimated cumulative `kappa*` score when this particular feature was added to the list. That is, the `kappa*` score of all currently selected features.
`SMIz`: The Standardized Mutual Information (SMI) (using the z-estimator) between this particular feature and the outcome.
`SMIz.Low`: Lower bound of the confidence interval for `SMIz`.
`SMIz.Upr`: Upper bound of the confidence interval for `SMIz`.
`p.MIz`: P-value between this particular feature and the outcome using the mutual information test of independence based on the z-estimator.
`Var.Name`: Column name of the selected feature.

Arguments

data: data frame with variables as columns and observations as rows. The data MUST include at least one feature (a.k.a., independent variable, predictor, factor) and only one outcome variable (Y). The outcome variable MUST BE THE LAST COLUMN. Both the features and the outcome MUST be categorical or discrete. If variables are not naturally discrete, you may preprocess them using the `autoBin.binary()` function in the same package.
NA.handle: options for handling NA values in the data. There are three options: `NA.handle = "stepwise"` (default), `NA.handle = "na.omit"`, and `NA.handle = "NA as a category"`. (1) `NA.handle = "stepwise"` excludes NA rows only when a particular variable is being used in a sub-step. For example, suppose we have data (Feature1: A, NA, B; Feature2: C, D, E; Feature3: F, G, H; Outcome: O, P, Q); the second observation will be excluded only when a particular step includes Feature1, but will not be excluded when a step is analyzing only Feature2, Feature3, and the Outcome. This option is designed to take advantage of the maximum possible number of observations. (2) `NA.handle = "na.omit"` excludes observations with any NA values at the beginning of the analysis. (3) `NA.handle = "NA as a category"` treats the NA value as a new category. This is designed to be used when NA values in the data have a consistent meaning instead of being missing values. For example, in survey data asking for comments, each NA value might consistently mean "no opinion."
alpha: level of significance for the confidence intervals in final results. By default, `alpha = 0.05`.
alpha.ind: level of significance for the mutual information test of independence in step 1 of the features selection (for an initial screening). The smaller the `alpha.ind`, the fewer features are sent to step 2 (<doi:10.3390/e21121179>). By default, `alpha.ind = 0.1`.
intermediate.steps: setting for outputting intermediate steps while awaiting the final results. There are two possible settings: `intermediate.steps = TRUE` or `intermediate.steps = FALSE`.
kappa.star.cap: a threshold of `kappa*` for halting the feature selection process. The program will automatically terminate at the first feature whose cumulative `kappa*` value exceeds the `kappa.star.cap` threshold. By default, `kappa.star.cap = 1.0`, which is the maximum possible value. A lower value may result in fewer final features but reduced computing time.
feature.num.cap: the maximum number of features to be selected. A lower value may result in fewer final features but less computing time.

Examples

Run this code

# ---- Generate a toy dataset for usage examples: "data" ----
set.seed(123)
n <- 200
x1 <- sample(c("A", "B", "C", "D"), size = n, replace = TRUE, prob = c(0.1, 0.2, 0.3, 0.4))
x2 <- sample(c("W", "X", "Y", "Z"), size = n, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1))
x3 <- sample(c("E", "F", "G", "H", "I"), size = n,
             replace = TRUE, prob = c(0.2, 0.3, 0.2, 0.2, 0.1))
x4 <- sample(c("A", "B", "C", "D"), size = n, replace = TRUE)
x5 <- sample(c("L", "M", "N"), size = n, replace = TRUE)
x6 <- sample(c("E", "F", "G", "H", "I"), size = n, replace = TRUE)

# Generate y variable dependent on x1 to x3
x1_num <- as.numeric(factor(x1, levels = c("A", "B", "C", "D")))
x2_num <- as.numeric(factor(x2, levels = c("W", "X", "Y", "Z")))
x3_num <- as.numeric(factor(x3, levels = c("E", "F", "G", "H", "I")))
# Calculate y with added noise
y_numeric <- 3*x1_num + 2*x2_num - 2*x3_num + rnorm(n,mean=0,sd=2)
# Discretize y into categories
y <- cut(y_numeric, breaks = 10, labels = paste0("Category", 1:10))

# Combine into a dataframe
data <- data.frame(x1, x2, x3, x4, x5, x6, y)

# The outcome of the toy dataset is dependent on x1, x2, and x3
# but is independent of x4, x5, and x6.
head(data)


# ---- Usage Examples ----

## Select features and provide relevant results:
CASMI.selectFeatures(data)

## Adjust 'feature.num.cap' for including fewer features:
## (Note: A lower 'feature.num.cap' value may result in fewer
## final features but less computing time.)
CASMI.selectFeatures(data, feature.num.cap = 2)

Run the code above in your browser using DataLab