categorical.igate: igate function for categorical target variables

Description

This function performs an initial Guided Analysis for parameter testing and controlband extraction (iGATE) for a categorical target variable on a dataset and returns those parameters found to be influential.

Usage

categorical.igate(df, versus = 8, target, best.cat, worst.cat,
  test = "w", ssv = NULL, outlier_removal_ssv = TRUE)

Arguments

Data frame to be analysed.

versus

How many Best of the Best and Worst of the Worst do we collect? By default, we will collect 8 of each.

target

Target variable to be analysed. Must be categorical. Use igate for continuous target.

best.cat

The best category. The versus BOB will be selected randomly from this category.

worst.cat

The worst category. The versus WOW will be selected randomly from this category.

test

Statistical hypothesis test to be used to determine influential process parameters. Choose between Wilcoxon Rank test ("w", default) and Student's t-test ("t").

ssv

A vector of suspected sources of variation. These are the variables in df which we believe might have an influence on the target variable and will be tested. If no list of ssv is provided, the test will be performed on all numeric variables.

outlier_removal_ssv

Logical. Should outlier removal be performed for each ssv (default: TRUE)?

Value

A data frame with the following columns

`Causes`	Those `ssv` that have been found to be influential to the `target` variable.
`Count`	The value returned by the counting method.
`p.value`	The p-value of the hypothesis test performed, i.e. either of the Wilcoxon rank test (in case `test = "w"`) or the t-test (if `test = "t"`).
`good_lower_bound`	The lower bound for this `Cause` for good quality.
`good_upper_bound`	The upper bound for this `Cause` for good quality.
`bad_lower_bound`	The lower bound for this `Cause` for bad quality.
`bad_upper_bound`	The upper bound for this `Cause` for bad quality.
`na_removed`	How many missing values were in the data set for this `Cause`?
`ties_best_cat`	How many observations fall into the best category?

Details

We collect the Best of the Best and the Worst of the Worst dynamically dependent on the current ssv. That means, for each ssv we first remove all the observations with missing values for that ssv from df. Then, based on the remaining observations, we randomly select versus observations from the the best category (<U+201C>Best of the Best<U+201D>, short BOB) and versus observations from the worst category (<U+201C>Worst of the Worst<U+201D>, short WOW). By default, we select 8 of each. Next, we compare BOB and WOW using the the counting method and the specified hypothesis test. If the distributions of the ssv in BOB and WOW are significantly different, the current ssv has been identified as influential to the target variable. An ssv is considered influential, if the test returns a count larger/ equal to 6 and/ or a p-value of less than 0.05. For the next ssv we again start with the entire dataset df, remove all the observations with missing values for that new ssv and then select our new BOB and WOW. In particular, for each ssv we might select different observations. This dynamic selection is necessary, because in case of an incomplete data set, if we select the same BOB and WOW for all the ssv, we might end up with many missing values for particular ssv. In that case the hypothesis test loses statistical power, because it is used on a smaller sample or worse, might fail altogether if the sample size gets too small.

For those ssv determined to be significant, control bands are extracted. The rationale is: If the value for an ssv is in the interval [good_lower_bound,good_upper_bound] the target is likely to be good. If it is in the interval [bad_lower_bound,bad_upper_bound], the target is likely to be bad.

Furthermore some summary statistics are provided: na_removed tells us how many observations have been removed for a particular ssv. When selecting the versus BOB/ WOW, the selection is done randomly from within the best/ worst category, i.e. the versus BOB/ WOW are not uniquely determined. The randomness in the selection is quantified by ties_best_cat, ties_worst_cat, which gives the size of the best/ worst category respectively.

Examples

Run this code

# NOT RUN {
df <- mtcars
df$cyl <- as.factor(df$cyl)
categorical.igate(df, target = "cyl", best.cat = "8", worst.cat = "4")

# }

Run the code above in your browser using DataLab