visstat: Automated Visualization of Statistical Hypothesis Testing

Description

visstat() provides automated visualization and selection of a statistical hypothesis test between a response and a feature variable in a given data.frame named dataframe, selecting a test that is appropriate under the data's type, distribution, sample size, and the specified conf.level. The data in dataframe must be structured column-wise, where varsample and varfactor are character strings corresponding to the column names of the response and feature variables, respectively. The automatically generated output figures illustrate the selected statistical hypothesis test, display the main test statistics, and include assumption checks and post hoc comparisons when applicable. The primary test results are returned as a list object.

Usage

visstat(
  dataframe,
  varsample,
  varfactor,
  conf.level = 0.95,
  numbers = TRUE,
  minpercent = 0.05,
  graphicsoutput = NULL,
  plotName = NULL,
  plotDirectory = getwd()
)

Value

list containing statistics of automatically selected test meeting assumptions. All values are returned as invisible copies. Values can be accessed by assigning a return value to visstat.

Arguments

dataframe: data.frame containing at least two columns. Data must be column wise ordered.
varsample: column name of the dependent variable (response) in dataframe, datatype character. varsample must be one entry of the list names(dataframe).
varfactor: column name of the independent variable (feature) in dataframe, datatype character.varsample must be one entry of the list names(dataframe).
conf.level: confidence level
numbers: a logical indicating whether to show numbers in mosaic count plots.
minpercent: number between 0 and 1 indicating minimal fraction of total count data of a category to be displayed in mosaic count plots.
graphicsoutput: saves plot(s) of type "png", "jpg", "tiff" or "bmp" in directory specified in plotDirectory. If graphicsoutput=NULL, no plots are saved.
plotName: graphical output is stored following the naming convention "plotName.graphicsoutput" in plotDirectory. Without specifying this parameter, plotName is automatically generated following the convention "statisticalTestName_varsample_varfactor".
plotDirectory: specifies directory, where generated plots are stored. Default is current working directory.

Details

Decision logic (for more details, please refer to the package's vignette).

Throughout, data of class numeric or integer are referred to as numerical, while data of class factor are referred to as categorical. The significance level $α$ is defined as one minus the confidence level, given by the argument conf.level.' Assumptions of normality and homoscedasticity are considered met when the corresponding test yields a p-value greater than alpha = 1 - 'conf.level'.

The choice of statistical tests performed by the function visstat() depends on whether the data are numerical or categorical, the number of levels in the categorical variable, the distribution of the data, and the chosen conf.level().

The function prioritizes interpretable visual output and tests that remain valid under their assumptions, following the decision logic below:

(1) When the response is numerical and the predictor is categorical, tests of central tendency are performed. If the categorical predictor has two levels: - Welch's t-test (t.test()) is used if both groups have more than 30 observations (Lumley et al. (2002) <doi:10.1146/annurev.publheath.23.100901.140546>). - For smaller samples, normality is assessed using shapiro.test(). If both groups return p-values greater than $α$ , Welch's t-test is applied; otherwise, the Wilcoxon rank-sum test (wilcox.test()) is used.

For predictors with more than two levels: - An ANOVA model (aov()) is initially fitted. - Residual normality is tested with shapiro.test() and ad.test(). If $p > α$ for either test, normality is assumed. - Homogeneity of variance is tested with bartlett.test(): - If $p > α$ , use ANOVA with TukeyHSD(). - If $p \leq α$ , use oneway.test() with TukeyHSD(). - If residuals are not normal, use kruskal.test() with pairwise.wilcox.test().

(2) When both the response and predictor are numerical, a linear model (lm()) is fitted, with residual diagnostics and a confidence band plot.

(3) When both variables are categorical, visstat() uses chisq.test() or fisher.test() depending on expected counts, following Cochran's rule (Cochran (1954) <doi:10.2307/3001666>).

Implemented main tests: t.test(), wilcox.test(), aov(), oneway.test(), lm(), kruskal.test(), fisher.test(), chisq.test().

Implemented tests for assumptions:

Normality: shapiro.test() and ad.test().
Heteroscedasticity: bartlett.test().

Implemented post hoc tests:

TukeyHSD() for aov() and oneway.test().
pairwise.wilcox.test() for kruskal.test().

Examples

Run this code


## Welch Two Sample t-test (calling t.test())
visstat(mtcars, "mpg", "am")

## Wilcoxon rank sum test (calling wilcox.test())
grades_gender <- data.frame(
  Sex = as.factor(c(rep("Girl", 20), rep("Boy", 20))),
  Grade = c(
    19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4,
    20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.3,
    16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4,
    7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6
  )
)
visstat(grades_gender, "Grade", "Sex")

## One-way analysis of means (oneway.test())
anova_npk <- visstat(npk, "yield", "block")
anova_npk # prints summary of tests

## Kruskal-Wallis rank sum test (calling kruskal.test())
visstat(iris, "Petal.Width", "Species")
visstat(InsectSprays, "count", "spray")

## Linear regression
visstat(trees, "Girth", "Height", conf.level = 0.99)

## Pearson's Chi-squared test and mosaic plot with Pearson residuals
### Transform array to data.frame
HairEyeColorDataFrame <- counts_to_cases(as.data.frame(HairEyeColor))
visstat(HairEyeColorDataFrame, "Hair", "Eye")

## 2x2 contingency tables with Fisher's exact test and mosaic plot
## with Pearson residuals
HairEyeColorMaleFisher <- HairEyeColor[, , 1]
### slicing out a 2 x2 contingency table
blackBrownHazelGreen <- HairEyeColorMaleFisher[1:2, 3:4]
blackBrownHazelGreen <- counts_to_cases(as.data.frame(blackBrownHazelGreen))
fisher_stats <- visstat(blackBrownHazelGreen, "Hair", "Eye")
fisher_stats # print out summary statistics



## Saving the graphical output in directory plotDirectory
## A) saving graphical output of type "png" in temporary directory tempdir()
##    with default naming convention:
visstat(blackBrownHazelGreen, "Hair", "Eye",
  graphicsoutput = "png",
  plotDirectory = tempdir()
)

## remove graphical output from plotDirectory
file.remove(file.path(tempdir(), "chi_squared_or_fisher_Hair_Eye.png"))
file.remove(file.path(tempdir(), "mosaic_complete_Hair_Eye.png"))

## B) Specifying pdf as output type:
visstat(iris, "Petal.Width", "Species",
  graphicsoutput = "pdf",
  plotDirectory = tempdir()
)

## remove graphical output from plotDirectory
file.remove(file.path(tempdir(), "kruskal_Petal_Width_Species.pdf"))

## C) Specifiying plotName overwrites default naming convention
visstat(iris, "Petal.Width", "Species",
  graphicsoutput = "pdf",
  plotName = "kruskal_iris", plotDirectory = tempdir()
)
## remove graphical output from plotDirectory
file.remove(file.path(tempdir(), "kruskal_iris.pdf"))

Run the code above in your browser using DataLab

Get 50% off unlimited learning