fvar
and fsd
are generic functions that compute the (column-wise) variance and standard deviation of x
, (optionally) grouped by g
and/or frequency-weighted by w
. The TRA
argument can further be used to transform x
using its (grouped, weighted) variance/sd.
fvar(x, …)
fsd(x, …)# S3 method for default
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, stable.algo = TRUE, …)
# S3 method for default
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, stable.algo = TRUE, …)
# S3 method for matrix
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, drop = TRUE, stable.algo = TRUE, …)
# S3 method for matrix
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, drop = TRUE, stable.algo = TRUE, …)
# S3 method for data.frame
fvar(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, drop = TRUE, stable.algo = TRUE, …)
# S3 method for data.frame
fsd(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, drop = TRUE, stable.algo = TRUE, …)
# S3 method for grouped_df
fvar(x, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE,
stable.algo = TRUE, …)
# S3 method for grouped_df
fsd(x, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE,
stable.algo = TRUE, …)
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').
a numeric vector of (non-negative) weights, may contain missing values.
an integer or quoted operator indicating the transformation to perform:
1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA
.
logical. Skip missing values in x
. Defaults to TRUE
and implemented at very little computational cost. If na.rm = FALSE
a NA
is returned when encountered.
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
matrix and data.frame method: Logical. TRUE
drops dimensions and returns an atomic vector if g = NULL
and TRA = NULL
.
grouped_df method: Logical. FALSE
removes grouping variables after computation.
grouped_df method: Logical. Retain summed weighting variable after computation (if contained in grouped_df
).
logical. TRUE
(default) use Welford's numerically stable online algorithm. FALSE
implements a faster but numerically unstable one-pass method. See Details.
arguments to be passed to or from other methods.
fvar
returns the variance of x
, grouped by g
, or (if TRA
is used) x
transformed by its variance, grouped by g
. fsd
computes the standard deviation of x
in like manor.
Welford's online algorithm used by default to compute the variance is well described here (the section Weighted incremental algorithm also shows how the weighted variance is obtained by this algorithm).
If stable.algo = FALSE
, the variance is computed in one-pass as (sum(x^2)-n*mean(x)^2)/(n-1)
, where sum(x^2)
is the sum of squares from which the expected sum of squares n*mean(x)^2
is subtracted, normalized by n-1
(Bessel's correction). This is numerically unstable if sum(x^2)
and n*mean(x)^2
are large numbers very close together, which will be the case for large n
, large x
-values and small variances (catastrophic cancellation occurs, leading to a loss of numeric precision). Numeric precision is however still maximized through the internal use of long doubles in C++, and the fast algorithm can be up to 4-times faster compared to Welford's method.
The weighted variance is computed with frequency weights as (sum(x^2*w)-sum(w)*weighted.mean(x,w)^2)/(sum(w)-1)
. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
Missing-value removal as controlled by the na.rm
argument is done very efficiently by simply skipping the values (thus setting na.rm = FALSE
on data with no missing values doesn't give extra speed). Large performance gains can nevertheless be achieved in the presence of missing values if na.rm = FALSE
, since then the corresponding computation is terminated once a NA
is encountered and NA
is returned.
This all seamlessly generalizes to grouped computations, which are performed in a single pass (without splitting the data) and therefore extremely fast.
When applied to data frames with groups or drop = FALSE
, fvar/fsd
preserves all column attributes (such as variable labels) but does not distinguish between classed and unclassed object (thus applying fvar/fsd
to a factor column will give a 'malformed factor' error). The attributes of the data frame itself are also preserved.
Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics. 4 (3): 419-420. doi:10.2307/1266577.
# NOT RUN {
## default vector method
fvar(mtcars$mpg) # Simple variance (all examples also hold for fvar!)
fsd(mtcars$mpg) # Simple standard deviation
fsd(mtcars$mpg, w = mtcars$hp) # Weighted sd: Weighted by hp
fsd(mtcars$mpg, TRA = "/") # Simple transformation: scaling (See also ?fscale)
fsd(mtcars$mpg, mtcars$cyl) # Grouped sd
fsd(mtcars$mpg, mtcars$cyl, mtcars$hp) # Grouped weighted sd
fsd(mtcars$mpg, mtcars$cyl, TRA = "/") # Scaling by group
fsd(mtcars$mpg, mtcars$cyl, mtcars$hp, "/") # Group-scaling using weighted group sds
## data.frame method
fsd(iris) # This works, although 'Species' is a factor variable
fsd(mtcars, drop = FALSE) # This works, all columns are numeric variables
fsd(iris[-5], iris[5]) # By Species: iris[5] is still a list, and thus passed to GRP()
fsd(iris[-5], iris[[5]]) # Same thing much faster: fsd recognizes 'Species' is a factor
head(fsd(iris[-5], iris[[5]], TRA = "/")) # Data scaled by species (see also fscale)
## matrix method
m <- qM(mtcars)
fsd(m)
fsd(m, mtcars$cyl) # etc..
# }
# NOT RUN {
<!-- % No code relying on suggested package -->
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
mtcars %>% group_by(cyl,vs,am) %>% fsd()
mtcars %>% group_by(cyl,vs,am) %>% fsd(keep.group_vars = FALSE) # Remove grouping columns
mtcars %>% group_by(cyl,vs,am) %>% fsd(hp) # Weighted by hp
mtcars %>% group_by(cyl,vs,am) %>% fsd(hp, "/") # Weighted scaling transformation
# }
Run the code above in your browser using DataLab