SparkR (version 2.1.2)

approxQuantile: Calculates the approximate quantiles of a numerical column of a SparkDataFrame


Calculates the approximate quantiles of a numerical column of a SparkDataFrame. The result of this algorithm has the following deterministic bound: If the SparkDataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the SparkDataFrame so that the *exact* rank of x is close to (p * N). More precisely, floor((p - err) * N) <= rank(x) <= ceil((p + err) * N). This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in [[ Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.


# S4 method for SparkDataFrame,character,numeric,numeric
approxQuantile(x, col,
  probabilities, relativeError)



A SparkDataFrame.


The name of the numerical column.


A list of quantile probabilities. Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.


The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.


The approximate quantiles at the given probabilities.

See Also

Other stat functions: corr, cov, crosstab, freqItems, sampleBy


Run this code
df <- read.json("/path/to/file.json")
quantiles <- approxQuantile(df, "key", c(0.5, 0.8), 0.0)
# }

Run the code above in your browser using DataCamp Workspace