skim: Skim a data frame, getting useful summary statistics

Description

skim() is an alternative to summary(), quickly providing a broad overview of a data frame. It handles data of all types, dispatching a different set of summary functions based on the types of columns in the data frame.

Usage

skim(.data, ...)
skim_tee(.data, ...)

Arguments

.data

A tibble, or an object that can be coerced into a tibble.

...

Additional options, normally used to list individual unquoted column names.

Value

A skim_df object, which can be treated like a tibble in most instances.

Customizing skim

skim() is an intentionally simple function, with minimal arguments like summary(). Nonetheless, this package provides two broad approaches to how you can customize skim()'s behavior. You can customize the functions that are called to produce summary statistics with skim_with(). You can customize how the output is displayed with skim_format().

Unicode rendering

If the rendered examples show unencoded values such as <U+2587> you will need to change your locale to allow proper rendering. Please review the Using Skimr vignette for more information (vignette("Using_skimr" package = "skimr")).

Details

Each call produces a skim_df, which is a fundamentally a tibble with a special print method. Instead of showing the result in a long format, skim prints a wide version of your data with formatting applied to each column. Printing does not change the structure of the skim_df, which remains a long tibble.

If you just want to see the printed output, call skim_tee() instead. This function returns the original data frame.

If you want to work with a data frame that resembles the printed output, call skim_to_wide() or for a named list of data frames by type skim_to_list(). Note that all of the columns in the data frames produced by these functions are character. The intent is that you will be processing the printed result further, not the original data.

skim() is designed to operate in pipes and to generally play nicely with other tidyverse functions. This means that you can use tidyselect helpers within skim to select or drop specific columns for summary. You can also further work with a skim_df using dplyr functions in a pipeline.

Examples

Run this code

# NOT RUN {
skim(iris)

# Use tidyselect
skim(iris, Species)
skim(iris, starts_with("Sepal"))

# Skim also works groupwise
dplyr::group_by(iris, Species) %>% skim()

# Skim pipelines; now we work with the tall format
skim(iris) %>% as.data.frame()
skim(iris) %>% dplyr::filter(type == "factor")

# Which column as the greatest mean value?
skim(iris) %>%
  dplyr::filter(stat == "mean") %>%
  dplyr::arrange(dplyr::desc(value))

# Use skim_tee to view the skim results and
# continue using the original data.
chickwts %>% skim_tee() %>% dplyr::filter(feed == "sunflower")
# }

Run the code above in your browser using DataLab