estimatePDF: Nonparametric Density Estimation

Description

Estimates the probability density function for a data sample.

Usage

estimatePDF(sample, pdfLength = NULL, estimationPoints = NULL, 
lowerBound = NULL, upperBound = NULL, target = 70, lagrangeMin = 1, 
lagrangeMax = 200, debug = 0, outlierCutoff = 7, smooth = TRUE)

Value

failedSolution: returns true if the pdf calculated is not considered an acceptable estimate of the data according to the scoring function.
threshold: represents the quality of the solution returned. Values of 40 to 70 indicate high confidence in the estimate. Values less than 5 are considered to be of poor quality. For more information on scoring see the referenced publication.
x: estimated range of density data
pdf: estimated probability density function
cdf: estimated cummulative density function
sqr: scaled quantile residual. Provides a sample-size invariant measure of the fluctuations in the estimate.
sqrSize: length of the returned scaled quantile residual. In most cases, this is the size of the input sample. Exceptions are if outliers are detected and/or if the failedSolution flag is true.
lagrange: values of lagrange multipliers. Can be used to reproduce the expansions for an analytical solution.
r: inverse of cdf for the sample.

Arguments

sample: the data sample from which to calculate the density estimate. If the sample has more than 1 column, the multivariate estimation function, estimatePDFmv(), is called instead.
pdfLength: the desired length of the estimate returned. Default value is calculated based on sample length. Overriding this calculation can increase or decrease the resolution of the estimate.
estimationPoints: a vector containing the points to estimate. If not specified, this is calculated automatically to span the entire sample data.
lowerBound: the lower bound of the PDF, if known. Default value is calculated based on the range of the data sample.
upperBound: the upper bound of the PDF, if known. Default value is calculated based on the range of the data sample.
target: a value from 1 to 100 representing the desired confidence percentage for the estimate score. The default of 70% represents the most likely score based on empirical simulations. A lower value may smooth estimates. A higher value tends to overfit to the sample and is not recommended.
lagrangeMin: minimum number of lagrange multipliers
lagrangeMax: maximum number of lagrange multipliers
debug: verbose output printed to console
outlierCutoff: outliers are automatically detected and removed according to the formula: < Q1 - outlierCutoff * IQR; or > Q3 + outlierCutoff * IQR, where Q1, Q3, and IQR represent the first quartile, third quartile, and inter-quartile range, respectively. Setting outlierCutoff = 0 turns off outlier detection.
smooth: minimizes noise in estimates, particularly in areas of low data density

Author

Jenny Farmer, Donald Jacobs

Details

A nonparametric density estimator based on the maximum-entropy method. Accurately predicts a probability density function (PDF) for random data using a novel iterative scoring function to determine the best fit without overfitting to the sample.

References

Farmer, J. and D. Jacobs (2018). "High throughput nonparametric probability density estimation." PLoS One 13(5): e0196937.

Examples

Run this code

#Estimates a normal distribution with 1000 sample points using default parameters

sampleSize = 1000
sample = rnorm(sampleSize, 0, 1)
dist = estimatePDF(sample)

Run the code above in your browser using DataLab