DepSearch: Test pairwise variable independence

Description

This is a high-level function which accepts a data set, stop criteria, and split functions for continuous variables and then applies a chi-square test for independence to bins generated by recursively binning the ranks of continuous variables or implied by the combinations of levels of categorical variables.

Usage

DepSearch(
  data,
  stopCriteria,
  catCon = uniRIntSplit,
  conCon = rIntSplit,
  ptype = c("simple", "conservative", "gamma", "fitted"),
  dropPoints = FALSE
)

Value

A `DepSearch` object, with slots `data`, `types`, `pairs`, `binnings`, `residuals`, `statistics`, `K`, `logps`, and `pvalues` that stores the results of using recursive binning with the specified splitting logic to test independence on a data set. `data` gives the name of the data object in the global environment which was split, `types` is a character vector giving the data types of each pair, `pairs` is a character vector of the variable names of each pair, `binnings` is a list of lists where each list is the binning fir to the corresponding pair by the recursive binning algorithm, `residuals` is list of numeric vectors giving the residual for each bin of each pairwise binning, `statistics` is a numeric vector giving the chi-squared statistic for each binning, `K` is a numeric vector giving the number of bins in each binning, `logps` gives the natural logarithm of the statistic's p-value, and finally `pvalues` is a numeric vector of p-values for `statistics` based on the specified p-value computation, which defaults to 'simple'. Internally, the p-values are computed on the log scale to better distinguish between strongly dependent pairs and the `pvalues` returned are computed by calling `exp(logps)`. The order of all returned values is by increasing `logps`.

Arguments

data: `data.frame` or object coercible to a `data.frame`
stopCriteria: output of `makeCriteria` providing criteria used to stop binning to be passed to binning functions
catCon: splitting function to apply to pairs of one cateogorical and one continuous variable
conCon: splitting function to apply to pairs of continuous variables
ptype: one of 'simple', 'conservative', 'gamma', or 'fitted': the type of p-values to compute for continuous pairs and pairs of mixed type. 'Conservative' assumes a chi-square distribution for the statistic with highly conservative degrees of freedom based on continuous uniform margins that do not account for the constraints introduced by the ranks. 'Simple' assumes a chi-square distribution but uses contingency-table inspired degrees of freedom which can be slightly anti-conservative in the case of continuous pairs but work well for continuous/categorical comparisons. 'Gamma' assumes a gamma distribution on the resulting statistics with parameters determined by empirical investigation. 'Fitted' mixes the gamma approach and the chi-squared approach these by applying 'gamma' to continuous-categorical comparisons and a least squares fitted version of the simple approximation to continuous-continuous comparisons with parameters determined by empirical study. For all categorical-categorical comparisons the contingency table degrees of freedom are used in a chi-square distribution.
dropPoints: logical; should returned bins contain points?

Author

Chris Salahub

Details

`DepSearch` is a wrapper function which organizes and executes pairwise binning to test independence between all variable pairs in `data`. While splitting logic of any sort is supported for continuous margins through the use of the `catCon` and `conCon` arguments, the default settings apply rRandom recursive binning, which proceeds for a single pair in three basic steps.

First, the types of the two pairs are identified and rank transformations are applied. If one or both are continuous, the continuous variables are transformed to their ranks. Categorical, logical, and ordinal variables are not transformed.

Second, the ranks of the continuous margins are partitioned by edges added at random positions recursively. For the case of dual continuous variables, the edge at each recursive step is added on a randomly selected margin. If one variable is not continuous, then only the continuous margin is recursively split.

Finally, the resulting partition is evaluated using a chi-square test. For non-continuous variables, this is the classic contingency table test. For continuous variables, expected counts for each cell of the partition are determined based on the area of the cell. The degrees of freedom for the case of a continuous margin are motivated by the contingency table case verified by empirical investigations. Alternatively, several other options are provided to allow a user to select the degrees of freedom approximation they prefer.

This procedure produces a p-value for every pairwise test, placing all pairwise measures on a comparable scale to each other. By placing edges randomly, the method avoids any systematic bias against particular patterns while still remaining powerful in the detection of function and non-function dependencies of any type.

The output of `DepSearch` is a list, the first element of which is a list of lists, each of which records the details of the binning of a particular pair of variables.

Examples

Run this code

## load the iris data set
data(iris)
## evaluate dependence in the iris data
iris_binnings <- DepSearch(iris)
## plot top departure displays
plot(iris_binnings)
## summarize reults
summary(iris_binnings)

Run the code above in your browser using DataLab