SparkR (version 3.1.2)

column_aggregate_functions: Aggregate functions for Column operations

Description

Aggregate functions defined for Column.

Usage

approx_count_distinct(x, ...)

approxCountDistinct(x, ...)

collect_list(x)

collect_set(x)

countDistinct(x, ...)

grouping_bit(x)

grouping_id(x, ...)

kurtosis(x)

n_distinct(x, ...)

percentile_approx(x, percentage, ...)

sd(x, na.rm = FALSE)

skewness(x)

stddev(x)

stddev_pop(x)

stddev_samp(x)

sumDistinct(x)

var(x, y = NULL, na.rm = FALSE, use)

variance(x)

var_pop(x)

var_samp(x)

# S4 method for Column approx_count_distinct(x, rsd = 0.05)

# S4 method for Column approxCountDistinct(x, rsd = 0.05)

# S4 method for Column kurtosis(x)

# S4 method for Column max(x)

# S4 method for Column mean(x)

# S4 method for Column min(x)

# S4 method for characterOrColumn,numericOrColumn percentile_approx(x, percentage, accuracy = 10000)

# S4 method for Column sd(x)

# S4 method for Column skewness(x)

# S4 method for Column stddev(x)

# S4 method for Column stddev_pop(x)

# S4 method for Column stddev_samp(x)

# S4 method for Column sum(x)

# S4 method for Column sumDistinct(x)

# S4 method for Column var(x)

# S4 method for Column variance(x)

# S4 method for Column var_pop(x)

# S4 method for Column var_samp(x)

# S4 method for Column approx_count_distinct(x, rsd = 0.05)

# S4 method for Column approxCountDistinct(x, rsd = 0.05)

# S4 method for Column countDistinct(x, ...)

# S4 method for Column n_distinct(x, ...)

# S4 method for Column collect_list(x)

# S4 method for Column collect_set(x)

# S4 method for Column grouping_bit(x)

# S4 method for Column grouping_id(x, ...)

Arguments

x

Column to compute on.

...

additional argument(s). For example, it could be used to pass additional Columns.

percentage

Numeric percentage at which percentile should be computed All values should be between 0 and 1. If length equals to 1 resulting column is of type double, otherwise, array type of double.

y, na.rm, use

currently not used.

rsd

maximum relative standard deviation allowed (default = 0.05).

accuracy

A positive numeric literal (default: 10000) which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.

Details

approx_count_distinct: Returns the approximate number of distinct items in a group.

approxCountDistinct: Returns the approximate number of distinct items in a group.

kurtosis: Returns the kurtosis of the values in a group.

max: Returns the maximum value of the expression in a group.

mean: Returns the average of the values in a group. Alias for avg.

min: Returns the minimum value of the expression in a group.

percentile_approx Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.

sd: Alias for stddev_samp.

skewness: Returns the skewness of the values in a group.

stddev: Alias for std_dev.

stddev_pop: Returns the population standard deviation of the expression in a group.

stddev_samp: Returns the unbiased sample standard deviation of the expression in a group.

sum: Returns the sum of all values in the expression.

sumDistinct: Returns the sum of distinct values in the expression.

var: Alias for var_samp.

var_pop: Returns the population variance of the values in a group.

var_samp: Returns the unbiased variance of the values in a group.

countDistinct: Returns the number of distinct items in a group.

n_distinct: Returns the number of distinct items in a group.

collect_list: Creates a list of objects with duplicates. Note: the function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

collect_set: Creates a list of objects with duplicate elements eliminated. Note: the function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

grouping_bit: Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set. Same as GROUPING in SQL and grouping function in Scala.

grouping_id: Returns the level of grouping. Equals to grouping_bit(c1) * 2^(n - 1) + grouping_bit(c2) * 2^(n - 2) + ... + grouping_bit(cn) .

See Also

Other aggregate functions: avg(), corr(), count(), cov(), first(), last()

Examples

Run this code
# NOT RUN {
# Dataframe used throughout this doc
df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
# }
# NOT RUN {
# }
# NOT RUN {
head(select(df, approx_count_distinct(df$gear)))
head(select(df, approx_count_distinct(df$gear, 0.02)))
head(select(df, countDistinct(df$gear, df$cyl)))
head(select(df, n_distinct(df$gear)))
head(distinct(select(df, "gear")))
# }
# NOT RUN {
# }
# NOT RUN {
head(select(df, mean(df$mpg), sd(df$mpg), skewness(df$mpg), kurtosis(df$mpg)))
# }
# NOT RUN {
# }
# NOT RUN {
head(select(df, avg(df$mpg), mean(df$mpg), sum(df$mpg), min(df$wt), max(df$qsec)))

# metrics by num of cylinders
tmp <- agg(groupBy(df, "cyl"), avg(df$mpg), avg(df$hp), avg(df$wt), avg(df$qsec))
head(orderBy(tmp, "cyl"))

# car with the max mpg
mpg_max <- as.numeric(collect(agg(df, max(df$mpg))))
head(where(df, df$mpg == mpg_max))
# }
# NOT RUN {
# }
# NOT RUN {
head(select(df, sd(df$mpg), stddev(df$mpg), stddev_pop(df$wt), stddev_samp(df$qsec)))
# }
# NOT RUN {
# }
# NOT RUN {
head(select(df, sumDistinct(df$gear)))
head(distinct(select(df, "gear")))
# }
# NOT RUN {
# }
# NOT RUN {
head(agg(df, var(df$mpg), variance(df$mpg), var_pop(df$mpg), var_samp(df$mpg)))
# }
# NOT RUN {
# }
# NOT RUN {
df2 = df[df$mpg > 20, ]
collect(select(df2, collect_list(df2$gear)))
collect(select(df2, collect_set(df2$gear)))
# }
# NOT RUN {
# }
# NOT RUN {
# With cube
agg(
  cube(df, "cyl", "gear", "am"),
  mean(df$mpg),
  grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
)

# With rollup
agg(
  rollup(df, "cyl", "gear", "am"),
  mean(df$mpg),
  grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
)
# }
# NOT RUN {
# }
# NOT RUN {
# With cube
agg(
  cube(df, "cyl", "gear", "am"),
  mean(df$mpg),
  grouping_id(df$cyl, df$gear, df$am)
)

# With rollup
agg(
  rollup(df, "cyl", "gear", "am"),
  mean(df$mpg),
  grouping_id(df$cyl, df$gear, df$am)
)
# }

Run the code above in your browser using DataLab