igate: igate function for continuous target variables

Description

This function performs an initial Guided Analysis for parameter testing and controlband extraction (iGATE) on a dataset and returns those parameters found to be influential.

Usage

igate(df, versus = 8, target, test = "w", ssv = NULL,
  outlier_removal_target = TRUE, outlier_removal_ssv = TRUE,
  good_end = "low", savePlots = FALSE, image_directory = tempdir())

Arguments

Data frame to be analysed.

versus

How many Best of the Best and Worst of the Worst do we collect? By default, we will collect 8 of each.

target

Target varaible to be analysed. Must be continuous. Use categorical.igate for categorical target.

test

Statistical hypothesis test to be used to determine influential process parameters. Choose between Wilcoxon Rank test ("w", default) and Student's t-test ("t").

ssv

A vector of suspected sources of variation. These are the variables in df which we believe might have an influence on the target variable and will be tested. If no list of ssv is provided, the test will be performed on all numeric variables.

outlier_removal_target

Logical. Should outliers (with respect to the target variable) be removed from df (default: TRUE)? Important: This only makes sense if no prior outlier removal has been performed on df, i.e. df still contains all the data. Otherwise calculation for outlier threshold will be falsified.

outlier_removal_ssv

Logical. Should outlier removal be performed for each ssv (default: TRUE)?

good_end

Are low (default) or high values of target variable good? This is needed to determine the control bands.

savePlots

Logical, only relevant if outlier_removal_target is TRUE. If savePlots == FALSE (the default) the boxplot of the target variable will be output to the standard output device for plots, usually the console. If TRUE, the boxplot will additionally be saved to image_directory as a png file.

image_directory

Directory to which plots should be saved. This is only used if savePlots = TRUE and defaults to the temporary directory of the current R session, i.e. tempdir(). To save plots to the current working directory set savePlots = TRUE and image_directory = getwd().

Value

A data frame with the following columns

`Causes`	Those ssv that have been found to be influential to the target variable.
`Count`	The value returned by the counting method.
`p.value`	The p-value of the hypothesis test performed, i.e. either of the Wilcoxon rank test (in case `test = "w"`) or the t-test (if `test = "t"`).
`good_lower_bound`	The lower bound for this `Cause` for good quality.
`good_upper_bound`	The upper bound for this `Cause` for good quality.
`bad_lower_bound`	The lower bound for this `Cause` for bad quality.
`bad_upper_bound`	The upper bound for this `Cause` for bad quality.
`na_removed`	How many missing values were in the data set for this `Cause`?
`ties_lower_end`	Number of tied observations at lower end of `target` when selecting the `versus` BOB/ WOW.
`competition_lower_end`	For how many positions are the `tied_obs_lower` competing?
`ties_upper_end`	Number of tied observations at upper end of `target` when selecting the `versus` BOB/ WOW.
`competition_upper_end`	For how many positions are the `tied_obs_upper` competing?

Details

We collect the Best of the Best and the Worst of the Worst dynamically dependent on the current ssv. That means, for each ssv we first remove all the observations with missing values for that ssv from df. Then, based on the remaining observations, we select versus observations with the best values for the target variable (<U+201C>Best of the Best<U+201D>, short BOB) and versus observations with the worst values for the target variable (<U+201C>Worst of the Worst<U+201D>, short WOW). By default, we select 8 of each. Next, we compare BOB and WOW using the the counting method and the specified hypothesis test. If the distributions of the ssv in BOB and WOW are significantly different, the current ssv has been identified as influential to the target variable. An ssv is considered influential, if the test returns a count larger/ equal to 6 and/ or a p-value of less than 0.05. For the next ssv we again start with the entire dataset df, remove all the observations with missing values for that new ssv and then select our new BOB and WOW. In particular, for each ssv we might select different observations. This dynamic selection is necessary, because in case of an incomplete data set, if we select the same BOB and WOW for all the ssv, we might end up with many missing values for particular ssv. In that case the hypothesis test loses statistical power, because it is used on a smaller sample or worse, might fail altogether if the sample size gets too small.

For those ssv determined to be significant, control bands are extracted. The rationale is: If the value for an ssv is in the interval [good_lower_bound,good_upper_bound] the target is likely to be good. If it is in the interval [bad_lower_bound,bad_upper_bound], the target is likely to be bad.

Furthermore some summary statistics are provided: When selecting the versus BOB/ WOW, tied values for target can mean that the versus BOB/ WOW are not uniquely determined. In that case we randomly select from the tied observations to give us exactly versus observations per group. ties_lower_end, cometition_lower_end, ties_upper_end, competition_upper_end quantify this randomness. How to interpret these values: lower end refers to the group whose target values are low and upper end to the one whose target values are high. For example if a low value for target is good, lower end refers to the BOB and upper end to the WOW. We determine the versus BOB/ WOW via

lower_end <- df[min_rank(df$target)<=versus,]

If there are tied observations, nrow(lower_end) can be larger than versus. In ties_lower_end we record how many observations in lower_end$target have the highest value and in competition_lower_end we record for how many places they are competing, i.e. competing_for_lower <- versus - (nrow(lower_end) - ties_lower_end). The values for ties_upper_end and competition_upper_end are determined analogously.

Examples

Run this code

# NOT RUN {
igate(iris, target = "Sepal.Length")

# }

Run the code above in your browser using DataCamp Workspace