samp.dist: Animated and/or snapshot representations of a statistic's sampling distribution

Description

This help page describes a series of asbio functions for depicting sampling distributions. The function samp.dist samples from a parent distribution without replacement with sample size = s.size, R times. At each iteration a statistic requested in stat is calculated. Thus a distribution of R statistic estimates is created. The function samp.dist shows this distribution as an animated anim = TRUE or non-animated anim = FALSE density histogram. Sampling distributions for up to four different statistics utilizing two different parent distributions are possible using samp.dist. Sampling distributions can be combined in various ways by specifying a function in func (see below). The function samp.dist.n was designed to show (with animation) how sampling disributions vary with sample size, and is still under development. The function samp.dist.snap creates snapshots, i.e. simultaneous views of a sampling distribution at particular sample sizes. The function dirty.dist can be used to create contaminated parent distributions.

Usage

samp.dist(parent = NULL, parent2 = NULL, biv.parent = NULL, s.size = 1, s.size2
 = NULL, R = 1000, nbreaks = 50, stat = mean, stat2 = NULL, stat3 = NULL, stat4 
 = NULL, xlab = expression(bar(x)), func = NULL, show.n = TRUE, show.SE = FALSE, 
 anim = TRUE, interval = 0.01, col.anim = "rainbow", digits = 3, ...)

samp.dist.snap(parent = NULL, parent2 = NULL, biv.parent = NULL, stat = mean, 
stat2 = NULL, stat3 = NULL, stat4 = NULL, s.size = c(1, 3, 6, 10, 20, 50), 
s.size2 = NULL, R = 1000, func = NULL, xlab = expression(bar(x)), 
show.SE = TRUE, fits = NULL, show.fits = TRUE, xlim = NULL, ylim = NULL, ...)

samp.dist.method.tck()

samp.dist.tck(statc = "mean")

samp.dist.snap.tck1(statc = "mean")

samp.dist.snap.tck2(statc = "mean")

dirty.dist(s.size, parent = expression(rnorm(1)), 
cont = expression(rnorm(1, mean = 10)), prop.cont = 0.1)

samp.dist.n(parent, R = 500, n.seq = seq(1, 30), stat = mean, xlab = expression(bar(x)), 
    nbreaks = 50, func = NULL, show.n = TRUE, 
    show.SE = FALSE, est.density = TRUE, col.density = 4, lwd.density = 2, 
    est.ylim = TRUE, ylim = NULL, anim = TRUE, interval = 0.5, 
    col.anim = NULL, digits = 3, ...)

Arguments

parent

A vector or vector generating function, describing the parental distribution. Any collection of values can be used. When using random value generators for parental distributions, for CPU effiency (and accuracy) one should use parent = expre

parent2

An optional second parental distribution (see parent above), useful for the construction of sampling distributions of test statistics. When using random value generators use parent2 = expression(rpdf(s.size2, ...)).

biv.parent

A bivariate (two column) distribution.

s.size

An integer defining sample size (or a vector of integers in the case of samp.dist.snap) to be taken at each of R iterations from the parental distribution.

s.size2

An optional integer definining a second sample size if a second statistic is to be calculated. Again, this will be a vector of integers in the of samp.dist.snap.

The number of samples to be taken from parent distribution(s).

nbreaks

Number of breaks in the histogram.

stat

The statistic whose sampling distribution is to be represented. Will work for any summary statistic that only requires a call to data; e.g. mean, var,

stat2

An optional second statistic. Useful for conceptualizing sampling distributions of test statistics. Calculated from sampling parent2.

stat3

An optional third statistic. The sampling distribution is created from the same sample data used for stat.

stat4

An optional fourth statistic. The sampling distribution is created from the same sample data used for stat2.

xlab

X-axis label.

func

An optional function used to manipulate a sampling distribution or to combine the sampling distributions of two or more statistics. The function must contain the following arguemets (although they needn't all be used in the function): s.di

show.n

A logical command, TRUE indicates that sample size for parent will be displayed.

show.SE

A logical command, TRUE indicates that bootstrap standard error for the statistic will be displayed.

anim

A logical command indicating whether or not animation should be used.

interval

Animation speed. Decreasing interval increases speed.

col.anim

Color to be used in animation. Three changing color palettes: rainbow, gray, heat.colors, or "fixed" color

digits

The number of digits to be displayed in the bootstrap standard error.

fits

Fitted distributions for samp.dist.snap A function with two argument: s.size and s.size2

show.fits

Logical indicating whether or not fits should be shown (fits will not be shown if no fitting function is specified regardless of whether this is TRUE or FALSE

xlim

A two element numeric vector defining the upper and lower limits of the X-axis.

ylim

A two element numeric vector defining the upper and lower limits of the Y-axis.

statc

Presets for certain statistics. Currently one of

"custom", "mean", 
  "median", "trimmed mean", "Winsorized mean", "Huber estimator", "H-L estimator", 
  "sd", "var", "IQR", "MAD", "(n-1)S^2/sigma^2", "F*", "t* (1 sample)", "t* (2 sample)", "Pear

cont

A distribution representing a source of contamination in the parent population. Used by function dirty.dist.

prop.cont

The proportion of the parent distribution that is contaminated by code.

n.seq

A range of sample sizes for samp.dist.n

est.density

A logical command for samp.dist.n. if TRUE then a density line is plotted over the histogram. Only used if fix.n = true.

col.density

The color of the density line for samp.dist.n. See est.density above.

lwd.density

The width of the density line for samp.dist.n. See est.density above.

est.ylim

Logical. If TRUE Y-axis limits are estimated logically for the animation in samp.dist.n. Consistent Y-axis limits make animations easier to visualize. Only used if fix.n = TRUE.

...

Additional arguments from plot.histogram.

Value

Returns a representation of a statistic's sampling distribution in the form of a histogram.

Details

Sampling distributions of individual statistics can be created with samp.dist, or the function can be used in more sophisticated ways, e.g. to create sampling distributions of ratios of statistics, i.e. t*, F* etc. (see examples below). To provide pedagogical clarity animation for figures is provided. To calculate bivariate statistics, specify the parent distribution with biv.parent and the statistic with func (see below). Two general uses of the function samp.dist are possible. 1) One can demonstrate the accumulation of statistics for a single sample size using animation. This is useful because as more and more statistics are acquired the frequentist paradigm associated with sampling distributions becomes better represented (i.e the number of estimates is closer to infinity). This is elucidated by allowing the default fix.n = TRUE. Animation will be provided with the default anim = TRUE. Up two parent distributions, up to two sample sizes, and up to four distinct statistics (i.e. four distinct sampling distributions, representing four distinct estimators) can be used. The arguments stat and stat3 will be drawn from parent, while stat3 and stat4 will be drawn from parent2. These distributions can be manipulated and combined in an infinite number of ways with an auxiliary function called in the argument func (see examples below). This allows depiction of sampling distributions made up of multiple estimators, e.g. test statistics. 2) One can provide simultaneous snapshots of a sampling distribution at a particular sample size with the function samp.dist.snap. Loading the package tcltk allows use of the functions samp.dist.tck, samp.dist.method.tck, samp.dist.snap.tck1 and samp.dist.snap.tck2, which provide interactive GUIs that run samp.dist.

Examples

Run this code

##Central limit theorem
#Snapshots of four sample sizes.
samp.dist.snap(parent=expression(rexp(s.size)), s.size = c(1,5,10,50), R = 1000)

##sample mean animation
samp.dist(parent=expression(rexp(s.size)), col.anim="heat.colors", interval=.3)

##Distribution of t-statistics from a pooled variance t-test under valid and invalid assumptions
#valid
t.star<-function(s.dist1, s.dist2, s.dist3, s.dist4, s.size = 6, s.size2 = 
s.size2){
MSE<-(((s.size - 1) * s.dist3) + ((s.size2 - 1) * s.dist4))/(s.size + s.size2-2)
func.res <- (s.dist1 - s.dist2)/(sqrt(MSE) * sqrt((1/s.size) + (1/s.size2)))
func.res}

samp.dist(parent = expression(rnorm(s.size)), parent2 = 
expression(rnorm(s.size2)), s.size=6, s.size2 = 6, R=1000, stat = mean, 
stat2 = mean, stat3 = var, stat4 = var, xlab = "t*", func = t.star)

curve(dt(x, 10), from = -6, to = 6, add = TRUE, lwd = 2)
legend("topleft", lwd = 2, col = 1, legend = "t(10)")

#invalid; same population means (null true) but different variances and other distributional 
#characteristics.
samp.dist(parent = expression(runif(s.size, min = 0, max = 2)), parent2 = 
expression(rexp(s.size2)), s.size=6, s.size2 = 6, R = 1000, stat = mean, 
stat2 = mean, stat3 = var, stat4 = var, xlab = "t*", func = t.star)

curve(dt(x, 10),from = -6, to = 6,add = TRUE, lwd = 2)
legend("topleft", lwd = 2, col = 1, legend = "t(10)")

## Pearson's R
require(mvtnorm)
BVN <- function(s.size) rmvnorm(s.size, c(0, 0), sigma = matrix(ncol = 2, 
nrow = 2, data = c(1, 0, 0, 1)))
samp.dist(biv.parent = expression(BVN(s.size)), s.size = 20, func = cor, xlab = "r")
                                                  
#Interactive GUI, require package 'tcltk'
samp.dist.tck("S^2")
samp.dist.snap.tck1("Huber estimator")
samp.dist.snap.tck2("F*")

Run the code above in your browser using DataLab