point_interval: Point and interval estimates for tidy sample data

Description

Translates samples in a (possibly grouped) data frame into a point and interval estimate (or set of point and interval estimates, if there are multiple groups in a grouped data frame).

Usage

point_interval(.data, ..., .prob = 0.95, .point = median, .interval = qi,
  .broom = TRUE)
# S3 method for default
point_interval(.data, ..., .prob = 0.95, .point = median,
  .interval = qi, .broom = TRUE)
# S3 method for numeric
point_interval(.data, ..., .prob = 0.95, .point = median,
  .interval = qi, .broom = FALSE)
point_intervalh(...)
qi(x, .prob = 0.95)
hdi(x, .prob = 0.95)
mean_qi(.data, ..., .prob = 0.95)
mean_qih(...)
median_qi(.data, ..., .prob = 0.95)
median_qih(...)
mode_qi(.data, ..., .prob = 0.95)
mode_qih(...)
mean_hdi(.data, ..., .prob = 0.95)
mean_hdih(...)
median_hdi(.data, ..., .prob = 0.95)
median_hdih(...)
mode_hdi(.data, ..., .prob = 0.95)
mode_hdih(...)

Arguments

.data

Data frame (or grouped data frame as returned by group_by) that contains samples to summarize.

...

Bare column names or expressions that, when evaluated in the context of .data, represent samples to summarise. If this is empty, then by default all columns that are not group columns or start with "." (e.g. ".chain" or ".iteration") will be summarised.

.prob

vector of probabilities to use for generating intervals. If multiple probabilities are provided, multiple rows per group are generated, each with a different probabilty interval (and value of the corresponding .prob column).

.point

Point estimate function, which takes a vector and returns a single value, e.g. mean, median, or Mode.

.interval

Interval estimate function, which takes a vector and a probability (.prob) and returns a two-element vector representing the lower and upper bound of an interval; e.g. qi, hdi

.broom

When TRUE and only a single column / vector is to be summarised, use the name conf.low for the lower end of the interval and conf.high for the upper end for consistency with tidy in the broom package. If .data is a vector and this is TRUE, this will also set the column name of the point estimate to estimate.

vector to summarise (for interval functions: qi and hdi)

Details

If .data is a data frame, then ... is a list of bare names of columns (or expressions derived from columns) of .data, on which the point and interval estimates are derived. Column expressions are processed using the tidy evaluation framework (see eval_tidy).

For a column named x, the resulting data frame will have a column named x containing its point estimate. If there is a single column to be summarized and .broom is TRUE, the output will also contain columns conf.low (the lower end of the interval), conf.high (the upper end of the interval). Otherwise, for every summarized column x, the output will contain x.low (the lower end of the interval) and x.high (the upper end of the interval). Finally, the output will have a .prob column containing the' probability for the interval on each output row.

If .data includes groups (see e.g. group_by), the points and intervals are calculated within the groups.

If .data is a vector, ... is ignored and the result is a data frame with one row per value of .prob and three columns: y (the point estimate), ymin (the lower end of the interval), ymax (the upper end of the interval), and .prob, the probability corresponding to the interval. This behavior allows point_interval and its derived functions (like median_qi, mean_qi, mode_hdi, etc) to be easily used to plot intervals in ggplot using methods like geom_eye, geom_eyeh, or stat_summary.

The functions ending in h (e.g., point_intervalh, median_qih) behave identically to the function without the h, except that when passed a vector, they return a data frame with x/xmin/xmax instead of y/ymin/ymax. This allows them to be used as values of the fun.data = argument of stat_summaryh. Note: these functions are not necessary if you use the point_interval argument of stats and geoms in the tidybayes package (e.g. stat_pointintervalh, geom_halfeyeh, etc), as these automatically adjust the function output to match their required aesthetics.

median_qi, mode_hdi, etc are short forms for point_interval(..., .point = median, .interval = qi), etc.

qi yields the quantile interval (also known as the percentile interval or equi-tailed interval) as a 1x2 matrix.

hdi yields the highest-density interval(s) (also known as the highest posterior density interval). Note: If the distribution is multimodal, hdi may return multiple intervals for each estimate (these will be spread over rows). Internally it uses hdi.

Examples

Run this code

# NOT RUN {
library(dplyr)
library(ggplot2)

set.seed(123)

rnorm(1000) %>%
  median_qi()

data.frame(x = rnorm(1000)) %>%
  median_qi(x, .prob = c(.50, .80, .95))

data.frame(
    x = rnorm(1000),
    y = rnorm(1000, mean = 2, sd = 2)
  ) %>%
  median_qi(x, y)

data.frame(
    x = rnorm(1000),
    group = "a"
  ) %>%
  rbind(data.frame(
    x = rnorm(1000, mean = 2, sd = 2),
    group = "b")
  ) %>%
  group_by(group) %>%
  median_qi(.prob = c(.50, .80, .95))

multimodal_samples = data.frame(
    x = c(rnorm(5000, 0, 1), rnorm(2500, 4, 1))
  )

multimodal_samples %>%
  mode_hdi(.prob = c(.66, .95))

multimodal_samples %>%
  ggplot(aes(x = x, y = 0)) +
  geom_halfeyeh(fun.data = mode_hdih, .prob = c(.66, .95))

# }

Run the code above in your browser using DataLab