approxQuantile: Calculates the approximate quantiles of a numerical column of a SparkDataFrame

Description

Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
The result of this algorithm has the following deterministic bound:
If the SparkDataFrame has N elements and if we request the quantile at probability p up to
error err, then the algorithm will return a sample x from the SparkDataFrame so that the
*exact* rank of x is close to (p * N). More precisely,
floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
This method implements a variation of the Greenwald-Khanna algorithm (with some speed
optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670
Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.

Usage

# S4 method for SparkDataFrame,character,numeric,numeric
approxQuantile(x, col,
probabilities, relativeError)

Arguments

x

A SparkDataFrame.

col

The name of the numerical column.

probabilities

A list of quantile probabilities. Each number must belong to [0, 1].
For example 0 is the minimum, 0.5 is the median, 1 is the maximum.

relativeError

The relative target precision to achieve (>= 0). If set to zero,
the exact quantiles are computed, which could be very expensive.
Note that values greater than 1 are accepted but give the same result as 1.

Value

The approximate quantiles at the given probabilities.