naniar v0.4.2

0

Monthly downloads

0th

Percentile

Data Structures, Summaries, and Visualisations for Missing Data

Missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis. 'naniar' provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of 'ggplot2' and tidy data.

Readme

naniar

AppVeyor Build
StatusTravis-CI
Build
Status Coverage
StatusCRAN
Status
BadgeCRAN
Downloads Each
Month lifecycle

naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data. It does this by providing:

  • Shadow matrices, a tidy data structure for missing data:
    • bind_shadow() and nabular()
  • Shorthand summaries for missing data:
    • n_miss() and n_complete()
    • pct_miss()and pct_complete()
  • Numerical summaries of missing data in variables and cases:
    • miss_var_summary() and miss_var_table()
    • miss_case_summary(), miss_case_table()
  • Visualisation for missing data:
    • geom_miss_point()
    • gg_miss_var()
    • gg_miss_case()
    • gg_miss_fct()

For more details on the workflow and theory underpinning naniar, read the vignette Getting started with naniar.

For a short primer on the data visualisation available in naniar, read the vignette Gallery of Missing Data Visualisations.

Installation

You can install naniar from CRAN:

install.packages("naniar")

Or you can install the development version on github using remotes:

# install.packages("remotes")
remotes::install_github("njtierney/naniar")

A short overview of naniar

Visualising missing data might sound a little strange - how do you visualise something that is not there? One approach to visualising missing data comes from ggobi and manet, which replaces NA values with values 10% lower than the minimum value in that variable. This visualisation is provided with the geom_miss_point() ggplot2 geom

  • which we illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.

library(ggplot2)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_point()
#> Warning: Removed 42 rows containing missing values (geom_point).

ggplot2 does not handle these missing values, and we get a warning message about the missing values.

We can instead use geom_miss_point() to display the missing data


library(naniar)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_miss_point()

geom_miss_point() has shifted the missing values to now be 10% below the minimum value. The missing values are a different colour so that missingness becomes pre-attentive. As it is a ggplot2 geom, it supports features like faceting and other ggplot features.


p1 <-
ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_miss_point() + 
  facet_wrap(~Month, ncol = 2) + 
  theme(legend.position = "bottom")

p1

Data Structures

naniar provides a data structure for working with missing data, the shadow matrix (Swayne and Buja, 1998). The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as “NA”, and not missing is represented as “!NA”, and variable names are kep the same, with the added suffix “_NA" to the variables.


head(airquality)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

as_shadow(airquality)
#> # A tibble: 153 x 6
#>    Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA
#>    <fct>    <fct>      <fct>   <fct>   <fct>    <fct> 
#>  1 !NA      !NA        !NA     !NA     !NA      !NA   
#>  2 !NA      !NA        !NA     !NA     !NA      !NA   
#>  3 !NA      !NA        !NA     !NA     !NA      !NA   
#>  4 !NA      !NA        !NA     !NA     !NA      !NA   
#>  5 NA       NA         !NA     !NA     !NA      !NA   
#>  6 !NA      NA         !NA     !NA     !NA      !NA   
#>  7 !NA      !NA        !NA     !NA     !NA      !NA   
#>  8 !NA      !NA        !NA     !NA     !NA      !NA   
#>  9 !NA      !NA        !NA     !NA     !NA      !NA   
#> 10 NA       !NA        !NA     !NA     !NA      !NA   
#> # … with 143 more rows

Binding the shadow data to the data you help keep better track of the missing values. This format is called “nabular”, a portmanteau of NA and tabular. You can bind the shadow to the data using bind_shadow or nabular:

bind_shadow(airquality)
#> # A tibble: 153 x 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA
#>    <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>  
#>  1    41     190   7.4    67     5     1 !NA      !NA        !NA    
#>  2    36     118   8      72     5     2 !NA      !NA        !NA    
#>  3    12     149  12.6    74     5     3 !NA      !NA        !NA    
#>  4    18     313  11.5    62     5     4 !NA      !NA        !NA    
#>  5    NA      NA  14.3    56     5     5 NA       NA         !NA    
#>  6    28      NA  14.9    66     5     6 !NA      NA         !NA    
#>  7    23     299   8.6    65     5     7 !NA      !NA        !NA    
#>  8    19      99  13.8    59     5     8 !NA      !NA        !NA    
#>  9     8      19  20.1    61     5     9 !NA      !NA        !NA    
#> 10    NA     194   8.6    69     5    10 NA       !NA        !NA    
#> # … with 143 more rows, and 3 more variables: Temp_NA <fct>,
#> #   Month_NA <fct>, Day_NA <fct>
nabular(airquality)
#> # A tibble: 153 x 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA
#>    <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>  
#>  1    41     190   7.4    67     5     1 !NA      !NA        !NA    
#>  2    36     118   8      72     5     2 !NA      !NA        !NA    
#>  3    12     149  12.6    74     5     3 !NA      !NA        !NA    
#>  4    18     313  11.5    62     5     4 !NA      !NA        !NA    
#>  5    NA      NA  14.3    56     5     5 NA       NA         !NA    
#>  6    28      NA  14.9    66     5     6 !NA      NA         !NA    
#>  7    23     299   8.6    65     5     7 !NA      !NA        !NA    
#>  8    19      99  13.8    59     5     8 !NA      !NA        !NA    
#>  9     8      19  20.1    61     5     9 !NA      !NA        !NA    
#> 10    NA     194   8.6    69     5    10 NA       !NA        !NA    
#> # … with 143 more rows, and 3 more variables: Temp_NA <fct>,
#> #   Month_NA <fct>, Day_NA <fct>

Using the nabular format helps you manage where missing values are in your dataset and make it easy to do visualisations where you split by missingness:


airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp,
             fill = Ozone_NA)) + 
  geom_density(alpha = 0.5)

And even visualise imputations


airquality %>%
  bind_shadow() %>%
  simputation::impute_lm(Ozone ~ Temp + Solar.R) %>%
  ggplot(aes(x = Solar.R,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()
#> Warning: Removed 7 rows containing missing values (geom_point).

Or perform an upset plot - to plot of the combinations of missingness across cases, using the gg_miss_upset function


gg_miss_upset(airquality)

naniar does this while following consistent principles that are easy to read, thanks to the tools of the tidyverse.

naniar also provides handy visualations for each variable:


gg_miss_var(airquality)

Or the number of missings in a given variable at a repeating span

gg_miss_span(pedestrian,
             var = hourly_counts,
             span_every = 1500)

You can read about all of the visualisations in naniar in the vignette Gallery of missing data visualisations using naniar.

naniar also provides handy helpers for calculating the number, proportion, and percentage of missing and complete observations:

n_miss(airquality)
#> [1] 44
n_complete(airquality)
#> [1] 874
prop_miss(airquality)
#> [1] 0.04793028
prop_complete(airquality)
#> [1] 0.9520697
pct_miss(airquality)
#> [1] 4.793028
pct_complete(airquality)
#> [1] 95.20697

Numerical summaries for missing data

naniar provides numerical summaries of missing data, that follow a consistent rule that uses a syntax begining with miss_. Summaries focussing on variables or a single selected variable, start with miss_var_, and summaries for cases (the initial collected row order of the data), they start with miss_case_. All of these functions that return dataframes also work with dplyr’s group_by().

For example, we can look at the number and percent of missings in each case and variable with miss_var_summary(), and miss_case_summary(), which both return output ordered by the number of missing values.


miss_var_summary(airquality)
#> # A tibble: 6 x 3
#>   variable n_miss pct_miss
#>   <chr>     <int>    <dbl>
#> 1 Ozone        37    24.2 
#> 2 Solar.R       7     4.58
#> 3 Wind          0     0   
#> 4 Temp          0     0   
#> 5 Month         0     0   
#> 6 Day           0     0
miss_case_summary(airquality)
#> # A tibble: 153 x 3
#>     case n_miss pct_miss
#>    <int>  <int>    <dbl>
#>  1     5      2     33.3
#>  2    27      2     33.3
#>  3     6      1     16.7
#>  4    10      1     16.7
#>  5    11      1     16.7
#>  6    25      1     16.7
#>  7    26      1     16.7
#>  8    32      1     16.7
#>  9    33      1     16.7
#> 10    34      1     16.7
#> # … with 143 more rows

You could also group_by() to work out the number of missings in each variable across the levels within it.


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
airquality %>%
  group_by(Month) %>%
  miss_var_summary()
#> # A tibble: 25 x 4
#>    Month variable n_miss pct_miss
#>    <int> <chr>     <int>    <dbl>
#>  1     5 Ozone         5     16.1
#>  2     5 Solar.R       4     12.9
#>  3     5 Wind          0      0  
#>  4     5 Temp          0      0  
#>  5     5 Day           0      0  
#>  6     6 Ozone        21     70  
#>  7     6 Solar.R       0      0  
#>  8     6 Wind          0      0  
#>  9     6 Temp          0      0  
#> 10     6 Day           0      0  
#> # … with 15 more rows

You can read more about all of these functions in the vignette “Getting Started with naniar”.

Contributions

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Future Work

  • Extend the geom_miss_* family to include categorical variables, Bivariate plots: scatterplots, density overlays
  • SQL translation for databases
  • Big Data tools (sparklyr, sparklingwater)
  • Work well with other imputation engines / processes
  • Provide tools for assessing goodness of fit for classical approaches of MCAR, MAR, and MNAR (graphical inference from nullabor package)

Acknowledgements

Firstly, thanks to Di Cook for giving the initial inspiration for the package and laying down the rich theory and literature that the work in naniar is built upon. Naming credit (once again!) goes to Miles McBain. Among various other things, Miles also worked out how to overload the missing data and make it work as a geom. Thanks also to Colin Fay for helping me understand tidy evaluation and for features such as replace_to_na, miss_*_cumsum, and more.

A note on the name

naniar was previously named ggmissing and initially provided a ggplot geom and some other visualisations. ggmissing was changed to naniar to reflect the fact that this package is going to be bigger in scope, and is not just related to ggplot2. Specifically, the package is designed to provide a suite of tools for generating visualisations of missing values and imputations, manipulate, and summarise missing data.

…But why naniar?

Well, I think it is useful to think of missing values in data being like this other dimension, perhaps like C.S. Lewis’s Narnia - a different world, hidden away. You go inside, and sometimes it seems like you’ve spent no time in there but time has passed very quickly, or the opposite. Also, NAniar = na in r, and if you so desire, naniar may sound like “noneoya” in an nz/aussie accent. Full credit to @MilesMcbain for the name, and @Hadley for the rearranged spelling.

Functions in naniar

Name Description
cast_shadow Add a shadow column to a dataset
add_label_missings Add a column describing if there are any missings in the dataset
gg_miss_upset Plot the pattern of missingness using an upset plot.
geom_miss_point geom_miss_point
gg_miss_var Plot the number of missings for each variable
miss-complete-var-pct Percentage of variables containing missings or complete values
miss_var_run Find the number of missing and complete values in a single run
miss_var_span Summarise the number of missings for a given repeating span on a variable
gather_shadow Long form representation of a shadow matrix
miss-complete-case-prop Proportion of cases that contain a missing or complete values.
n_complete_row Return a vector of the number of complete values in each row
as_shadow_upset Convert data into shadow format for doing an upset plot
prop_miss Return the proportion of missing values
bind_shadow Bind a shadow dataframe to original data
shadow_long Reshape shadow data into a long format
prop_miss_row Return a vector of the proportion of missing values in each row
unbinders Unbind (remove) shadow from data, and vice versa
shadow_shift Shift missing values to facilitate missing data exploration/visualisation
n_miss Return the number of missing values
update_shadow Expand all shadow levels
any-na Identify if there are any missing or complete values
any_row_miss Helper function to determine whether there are any missings
which_na Which elements contain missings?
impute_median Impute the median value into a vector with missing values
cast_shadow_shift Add a shadow and a shadow_shift column to a dataset
impute_mean Impute the mean value into a vector with missing values
is_shade Detect if this is a shade
cast_shadow_shift_label Add a shadow column and a shadow shifted column to a dataset
impute_below_if Scoped variants of impute_below
common_na_numbers Common number values for NA
gg_miss_fct Plot the number of missings for each variable, broken down by a factor
as_shadow Create shadows
all_row_complete Helper function to determine whether all rows are complete
gg_miss_span Plot the number of missings in a given repeating span
is_shadow Test if input is or are shadow variables
label_miss_1d Label a missing from one column
gg_miss_case_cumsum Plot of cumulative sum of missing for cases
as_shadow.data.frame Create shadow data
gg_miss_case Plot the number of missings per case (row)
miss-complete-var-prop Proportion of variables containing missings or complete values
all_row_miss Helper function to determine whether all rows are missing
miss_var_summary Summarise the missingness in each variable
miss_case_summary Summarise the missingness in each case
group_by_fun Group By Helper
new_shade Create a new shade factor
new_nabular Create a new nabular format
n-var-case-miss The number of variables or cases with missing values
n_complete Return the number of complete values
recode_shadow Add special missing values to the shadow matrix
miss_var_table Tabulate the missings in the variables
common_na_strings Common string values for NA
impute_below_all Impute data with values shifted 10% below range.
impute_below Impute data with values shifted 10% below range.
label_miss_2d label_miss_2d
label_missings Is there a missing value in the row of a dataframe?
gg_miss_which Plot which variables contain a missing value
gg_miss_var_cumsum Plot of cumulative sum of missing value for each variable
label_shadow Label shadow values as missing or not missing
draw_key Key drawing functions
reexports Objects exported from other packages
impute_below_at Scoped variants of impute_below
n_miss_row Return a vector of the number of missing values in each row
miss_case_table Tabulate missings in cases.
test_if_null Test if the input is NULL
nabular Convert data into nabular form by binding shade to it
new_shadow Create a new shadow
miss-complete-case-pct Percentage of cases that contain a missing or complete values.
prop-miss-complete-case Proportion of cases that contain a missing or complete values.
oceanbuoys West Pacific Tropical Atmosphere Ocean Data, 1993 & 1997.
miss_prop_summary Proportions of missings in data, variables, and cases.
GeomMissPoint naniar-ggproto
pct-miss-complete-case Percentage of cases that contain a missing or complete values.
test_if_shadow Test if input is a shadow
naniar naniar
prop-miss-complete-var Proportion of variables containing missings or complete values
pct-miss-complete-var Percentage of variables containing missings or complete values
pct_complete Return the percent of complete values
replace_with_na_if Replace values with NA based on some condition, for variables that meet some predicate
shade Create new levels of missing
miss_scan_count Search and present different kinds of missing values
what_levels check the levels of many things
riskfactors The Behavioral Risk Factor Surveillance System (BRFSS) Survey Data, 2009.
pct_miss Return the percent of missing values
replace_to_na Replace values with missings
replace_with_na Replace values with missings
miss_summary Collate summary measures from naniar into one tibble
where Split a call into two components with a useful verb name
replace_with_na_all Replace all values with NA where a certain condition is met
miss_var_which Which variables contain missing values?
n-var-case-complete The number of variables with complete values
replace_with_na_at Replace specified variables with NA where a certain condition is met
shadow_expand_relevel Expand and relevel a shadow column with a new suffix
pedestrian Pedestrian count information around Melbourne for 2016
shadow_shift.numeric Shift (impute) numeric values for graphical exploration
plotly_helpers Plotly helpers (Convert a geom to a "basic" geom.)
prop_complete Return the proportion of complete values
prop_complete_row Return a vector of the proportion of missing values in each row
stat_miss_point stat_miss_point
scoped-impute_mean Scoped variants of impute_mean
scoped-impute_median Scoped variants of impute_median
test_if_dataframe Test if input is a data.frame
test_if_missing Test if the input is Missing
where_na Which rows and cols contain missings?
which_are_shade Which variables are shades?
add_miss_cluster Add a column that tells us which "missingness cluster" a row belongs to
add_any_miss Add a column describing presence of any missing values
add_shadow Add a shadow column to dataframe
add_span_counter Add a counter variable for a span of dataframe
add_prop_miss Add column containing proportion of missing data values
add_label_shadow Add a column describing whether there is a shadow
add_n_miss Add column containing number of missing data values
add_shadow_shift Add a shadow shifted column to a dataset
all-is-miss-complete Identify if all values are missing or complete
No Results!

Vignettes of naniar

Name
exploring-imputed-values.Rmd
getting-started-w-naniar.Rmd
missingness-data-structures.png
naniar-visualisation.Rmd
replace-with-na.Rmd
special-missing-values.Rmd
No Results!

Last month downloads

Details

Type Package
License MIT + file LICENSE
LazyData TRUE
ByteCompile TRUE
VignetteBuilder knitr
Collate 'add-cols.R' 'add-n-prop-miss.R' 'cast-shadows.R' 'data-common-na-numbers.R' 'data-common-na-strings.R' 'data-oceanbuoys.R' 'data-pedestrian.R' 'data-riskfactors.R' 'legend-draw.R' 'geom-miss-point.R' 'geom2plotly.R' 'gg-miss-case-cumsum.R' 'gg-miss-case.R' 'gg-miss-fct.R' 'gg-miss-span.R' 'gg-miss-upset.R' 'gg-miss-var-cumsum.R' 'gg-miss-var.R' 'gg-miss-which.R' 'helpers.R' 'impute-median.R' 'impute_below.R' 'impute_mean.R' 'label-miss.R' 'miss-complete-x-pct-prop.R' 'miss-prop-pct-summary.R' 'miss-scan-count.R' 'miss-x-cumsum.R' 'miss-x-run.R' 'miss-x-span.R' 'miss-x-summary.R' 'miss-x-table.R' 'n-prop-miss-complete-rows.R' 'n-prop-miss-complete.R' 'n-var-miss.R' 'nabular.R' 'naniar-ggproto.R' 'naniar-package.R' 'prop-pct-var-case-miss-complete.R' 'replace-to-na.R' 'replace-with-na.R' 'scoped-replace-with-na.R' 'shade.R' 'shadow-recode.R' 'shadow-shifters.R' 'shadow-verifiers.R' 'shadows.R' 'stat-miss-point.R' 'utils.R' 'where-na.R'
URL https://github.com/njtierney/naniar
BugReports https://github.com/njtierney/naniar/issues
Encoding UTF-8
RoxygenNote 6.1.1
Language en-US
NeedsCompilation no
Packaged 2019-02-15 02:56:55 UTC; ntie0001
Repository CRAN
Date/Publication 2019-02-15 14:30:03 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/naniar)](http://www.rdocumentation.org/packages/naniar)