dat2bernqua: Observed Data to Quantiles through Bernstein or Kantorovich Polynomials

Description

The empirical quantile function can be smoothed (Hernández-Maldonado{Hernandez-Maldonado} and others, 2012, p. 114) through the Kantorovich polynomial (Muñoz-Pérez{Munoz-Perez} and Fernández-Palacín{Fernandez-Palacin}, 1987) for the sample order statistics $x_{k:n}$ for a sample of size $n$ by $$\tilde{X}_n(F) = \frac{1}{2}\sum_{k=0}^n (x_{k:n} + x_{(k+1):n}) {n \choose k} F^k (1-F)^{n-k}\mbox{,}$$ where $F$ is nonexceedance probability, and $(n\:k)$ are the binomial coefficients from the choose() function of R, and special situation for $k=0$ and $k=n$ are described within the Note section. The form for the Bernstein polynomial is $$\tilde{X}_n(F) = \sum_{k=0}^{n+1} (x_{k:n}) {n+1 \choose k} F^k (1-F)^{n+1-k}\mbox{.}$$ There are subtle differences between the two and this function supports each.

Turnbull and Ghosh (2014) consider through the direction of a referee and recommendation of $p=0.05$ by that referee (and credit to ideas by de Carvalho [2012]) that the support of the probability density function for the Turnbull and Ghosh (2014) study of Bernstein polynomials can be computed letting $\alpha = (1 - p)^{-2} - 1$ by $$\biggl(x_{1:n} - (x_{2:n} - x_{1:n})/\alpha,\: x_{n:n} + (x_{n:n} - x_{n-1:n})/\alpha\biggr)\mbox{,}$$ for the minimum and maximum, respectively. Evidently, the original support considered by Turnbull and Ghosh (2014) was $$\biggl(x_{1:n} - \lambda_2\sqrt{\pi/n},\: x_{n:n} + \lambda_2\sqrt{\pi/n}\biggr)\mbox{,}$$ for the minimum and maximum, respectively and where the standard deviation is estimated in the function using the 2nd L-moment as $s = \lambda\sqrt{\pi}$.

The $p$ is referred to by this author as the p-factor this value has great influence in the estimated support of the distribution and therefore distal-tail estimation or performance is sensitive to the value for $p$. General exploratory analysis suggests that the $p$ can be optimized based on information external or internal to the data for shape restrained smoothing. For example, an analyst might have external information as to the expected L-skew of the phenomenon being studied or could use the sample L-skew of the data (internal information) for shape restraint (see pfactor.bernstein).

An alternative formula for smoothing is by Cheng (1995) as is $$\tilde{X}^{\mathrm{Cheng}}_n(F) = \sum_{k=1}^n x_{k:n}\:{n - 1 \choose k-1}\: F^{k-1}(1-F)^{n-k}\mbox{.}$$

Usage

dat2bernqua(f, x, bern.control=NULL,
                  poly.type=c("Bernstein", "Kantorovich", "Cheng"),
                  bound.type=c("none", "sd", "Carv", "either"),
                  fix.lower=NULL, fix.upper=NULL, p=0.05,
                  listem=FALSE)

Arguments

A vector of nonexceedance probabilities $F$;.

A vector of data values.

bern.control

A list that holds poly.type, bound.type, fix.lower, and fix.upper. And this list will supersede the respective values provided as separate arguments.

poly.type

The Bernstein or Kantorovich polynomial will be used. The two are quite closely related. Or the formula by Cheng (1995) will be used and bounds.type, fix.lower, fix.upper, and p are not applicable.

bound.type

Triggers to the not involve alternative supports ("none") then the minimum and maximum are used unless already provided by the fix.lower or fix.upper, the support based "sd" on the standard deviation, th

fix.lower

For $k = 0$, either the known lower bounds is used if provided as non NULL or the observed minimum of the data. If the minimum of the data is less than the fix.lower, a warning is triggered and fix.lower is set to th

fix.upper

For $k = n$, either the known upper bounds is used if provided as non NULL or the observed maximum of the data; If the maximum of the data is less than the fix.upper, a warning is triggered and fix.upper is set to th

A small probability value to serve as the $p$ in the "Carv" support computation. The default is recommended as mentioned above. The program will return NA if $10^{-6} < p \ge (1-10^{-6})$ is not met. The value p is t

listem

A logical controlling whether (1) a vector of $\tilde{X}_n(F)$ is returned or (2) a list containing $\tilde{X}_n(F)$, the f, original sample size $n$ of the data, the de Carvalho probability p (whether actually used internally or

Value

An R vector is returned.

encoding

utf8

References

Asquith, W.H., 201X. Empirical quantile function smoothing and distal-tail estimation by Bernstein and Kantorovich polynomials incorporating regional L-moment information. [in review].

Cheng, C., 1995, The Berstein polynomial estimator of a smooth quantile function: Statistics and Probability Letters, v. 24, pp. 321--330.

de Carvalho, M., 2012, A generalization of the Solis-Wets method: Journal of Statistical Planning and Inference, v. 142, no. 3, pp. 633--644.

Hernández-Maldonado{Hernandez-Maldonado}, V., Díaz-Viera{Diaz-Viera}, M., and Erdely, A., 2012, A joint stochastic simulation method using the Bernstein copula as a flexible tool for modeling nonlinear dependence structures between petrophysical properties: Journal of Petroleum Science and Engineering, v. 90--91, pp. 112--123.

Muñoz-Pérez{Munoz-Perez}, J., and Fernández-Palacín{Fernandez-Palacin}, A., 1987, Estimating the quantile function by Bernstein polynomials: Computational Statistics and Data Analysis, v. 5, pp. 391--397.

Turnbull, B.C., and Ghosh, S.K., 2014, Unimodal density estimation using Bernstein polynomials: Computational Statistics and Data Analysis, v. 72, pp. 13--29.

Examples

Run this code

# Compute smoothed extremes, quartiles, and median
# The smoothing seems to extend to F=0 and F=1.
set.seed(1); X <- exp(rnorm(20)); F <- c(0,.25,.50,.75,1)
dat2bernqua(F, X, bound.type="none", listem=TRUE)$x
dat2bernqua(F, X, bound.type="Carv", listem=TRUE)$x
dat2bernqua(F, X, bound.type="sd", listem=TRUE)$x
dat2bernqua(F, X, bound.type="either", listem=TRUE)$x
dat2bernqua(F, X, bound.type="sd", fix.lower=0, listem=TRUE)$x
# Notice that the lower extreme between the last two calls
# changes from a negative to a positive number when the lower
# bounds is "known" to be zero.

X <- exp(rnorm(20)); F <- seq(0.001, 0.999, by=.001)
dat2bernqua(0.9, X, poly.type="Bernstein",   listem=TRUE)$x
dat2bernqua(0.9, X, poly.type="Kantorovich", listem=TRUE)$x
dat2bernqua(0.9, X, poly.type="Cheng",       listem=TRUE)$x
plot(pp(X), sort(X), log="y", xlim=range(F))
lines(F, dat2bernqua(F, X, poly.type="Bernstein"), col=2)   # red
lines(F, dat2bernqua(F, X, poly.type="Kantorovich"), col=3) # green
lines(F, dat2bernqua(F, X, poly.type="Cheng"), col=4)       # blue

X <- exp(rnorm(20)); F <- nonexceeds()
plot(pp(X), sort(X))
lines(F, dat2bernqua(F,X, bound.type="sd", poly.type="Bernstein"))
lines(F, dat2bernqua(F,X, bound.type="sd", poly.type="Kantorovich"), col=2)

X <- rnorm(25); F <- nonexceeds()
Q <- dat2bernqua(F, X) # the Bernstein estimates
plot(F,  dat2bernqua(F, X, bound.type="Carv"), type="l")
lines(F, dat2bernqua(F, X, bound.type="sd"), col=2)
lines(F, dat2bernqua(F, X, bound.type="none"), col=3)
points(pp(X), sort(X), pch=16, cex=.75, col=4)

set.seed(13)
par <- parkap(vec2lmom(c(1,.5,.4,.2)))
F <- seq(0.001,0.999,by=.001)
X <- sort(rlmomco(100, par))
pp <- pp(X)
pdf("junk.pdf")
plot(qnorm(pp(X)), dat2bernqua(pp, X), col=4, pch=1,
     ylim=c(0,qlmomco(0.9999, par)))
lines(qnorm(F), dat2bernqua(F, sort(X)), col=4)
lines(qnorm(F), qlmomco(F, par), col=2)
sampar <- parkap(lmoms(X))
sampar2 <- parkap(lmoms(dat2bernqua(pp, X)))
lines(qnorm(pp(F)), qlmomco(F, sampar), col=1)
lines(qnorm(pp(F)), qlmomco(F, sampar2), col=4, lty=2)
points(qnorm(pp(X)), X, col=1, pch=16)

plot(qnorm(pp(X)), dat2bernqua(pp, X, altsupport=TRUE), col=4, pch=1,
     ylim=c(0,qlmomco(0.9999, par)))
lines(qnorm(F), dat2bernqua(F, sort(X), altsupport=TRUE), col=4)
lines(qnorm(F), qlmomco(F, par), col=2)
sampar <- parkap(lmoms(X))
sampar2 <- parkap(lmoms(dat2bernqua(pp, X, altsupport=TRUE)))
lines(qnorm(pp(F)), qlmomco(F, sampar), col=1)
lines(qnorm(pp(F)), qlmomco(F, sampar2), col=4, lty=2)
points(qnorm(pp(X)), X, col=1, pch=16)
dev.off()

Run the code above in your browser using DataLab