Learn R Programming

⚠️There's a newer version (0.13.0) of this package.Take me there.

datawizard: Easy Data Wrangling

Hockety pockety wockety wack, prepare this data forth and back :sparkles:

datawizard is a lightweight package to easily manipulate, clean, transform, and prepare your data for analysis.

Installation

TypeSourceCommand
ReleaseCRANinstall.packages("datawizard")
Developmentr-universeinstall.packages("datawizard", repos = "https://easystats.r-universe.dev")
DevelopmentGitHubremotes::install_github("easystats/datawizard")

Citation

To cite the package, run the following command:

citation("datawizard")

To cite datawizard in publications use:

  Makowski, Lüdecke, Patil, Ben-Shachar, & Wiernik (2021). datawizard:
  Easy Data Wrangling. CRAN. Available from
  https://easystats.github.io/datawizard/

A BibTeX entry for LaTeX users is

  @Article{,
    title = {datawizard: Easy Data Wrangling},
    author = {Dominique Makowski and Daniel Lüdecke and Indrajeet Patil and Mattan S. Ben-Shachar and Brenton M. Wiernik},
    journal = {CRAN},
    year = {2021},
    note = {R package},
    url = {https://easystats.github.io/datawizard/},
  }

Features

Data wrangling

Select, filter and remove variables

The package provides helpers to filter rows meeting certain conditions…

data_match(mtcars, data.frame(vs = 0, am = 1))
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

… or logical expressions:

data_filter(mtcars, vs == 0 & am == 1)
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Finding columns in a data frame, or retrieving the data of selected columns, can be achieved using find_columns() or get_columns():

# find column names matching a pattern
find_columns(iris, starts_with("Sepal"))
#> [1] "Sepal.Length" "Sepal.Width"

# return data columns matching a pattern
get_columns(iris, starts_with("Sepal")) |> head()
#>   Sepal.Length Sepal.Width
#> 1          5.1         3.5
#> 2          4.9         3.0
#> 3          4.7         3.2
#> 4          4.6         3.1
#> 5          5.0         3.6
#> 6          5.4         3.9

It is also possible to extract one or more variables:

# single variable
data_extract(mtcars, "gear")
#>  [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4

# more variables
head(data_extract(iris, ends_with("Width")))
#>   Sepal.Width Petal.Width
#> 1         3.5         0.2
#> 2         3.0         0.2
#> 3         3.2         0.2
#> 4         3.1         0.2
#> 5         3.6         0.2
#> 6         3.9         0.4

Due to the consistent API, removing variables is just as simple:

head(data_remove(iris, starts_with("Sepal")))
#>   Petal.Length Petal.Width Species
#> 1          1.4         0.2  setosa
#> 2          1.4         0.2  setosa
#> 3          1.3         0.2  setosa
#> 4          1.5         0.2  setosa
#> 5          1.4         0.2  setosa
#> 6          1.7         0.4  setosa

Reorder or rename

head(data_relocate(iris, select = "Species", before = "Sepal.Length"))
#>   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1  setosa          5.1         3.5          1.4         0.2
#> 2  setosa          4.9         3.0          1.4         0.2
#> 3  setosa          4.7         3.2          1.3         0.2
#> 4  setosa          4.6         3.1          1.5         0.2
#> 5  setosa          5.0         3.6          1.4         0.2
#> 6  setosa          5.4         3.9          1.7         0.4
head(data_rename(iris, c("Sepal.Length", "Sepal.Width"), c("length", "width")))
#>   length width Petal.Length Petal.Width Species
#> 1    5.1   3.5          1.4         0.2  setosa
#> 2    4.9   3.0          1.4         0.2  setosa
#> 3    4.7   3.2          1.3         0.2  setosa
#> 4    4.6   3.1          1.5         0.2  setosa
#> 5    5.0   3.6          1.4         0.2  setosa
#> 6    5.4   3.9          1.7         0.4  setosa

Merge

x <- data.frame(a = 1:3, b = c("a", "b", "c"), c = 5:7, id = 1:3)
y <- data.frame(c = 6:8, d = c("f", "g", "h"), e = 100:102, id = 2:4)

x
#>   a b c id
#> 1 1 a 5  1
#> 2 2 b 6  2
#> 3 3 c 7  3
y
#>   c d   e id
#> 1 6 f 100  2
#> 2 7 g 101  3
#> 3 8 h 102  4

data_merge(x, y, join = "full")
#>    a    b c id    d   e
#> 3  1    a 5  1 <NA>  NA
#> 1  2    b 6  2    f 100
#> 2  3    c 7  3    g 101
#> 4 NA <NA> 8  4    h 102

data_merge(x, y, join = "left")
#>   a b c id    d   e
#> 3 1 a 5  1 <NA>  NA
#> 1 2 b 6  2    f 100
#> 2 3 c 7  3    g 101

data_merge(x, y, join = "right")
#>    a    b c id d   e
#> 1  2    b 6  2 f 100
#> 2  3    c 7  3 g 101
#> 3 NA <NA> 8  4 h 102

data_merge(x, y, join = "semi", by = "c")
#>   a b c id
#> 2 2 b 6  2
#> 3 3 c 7  3

data_merge(x, y, join = "anti", by = "c")
#>   a b c id
#> 1 1 a 5  1

data_merge(x, y, join = "inner")
#>   a b c id d   e
#> 1 2 b 6  2 f 100
#> 2 3 c 7  3 g 101

data_merge(x, y, join = "bind")
#>    a    b c id    d   e
#> 1  1    a 5  1 <NA>  NA
#> 2  2    b 6  2 <NA>  NA
#> 3  3    c 7  3 <NA>  NA
#> 4 NA <NA> 6  2    f 100
#> 5 NA <NA> 7  3    g 101
#> 6 NA <NA> 8  4    h 102

Reshape

A common data wrangling task is to reshape data.

Either to go from wide/Cartesian to long/tidy format

wide_data <- data.frame(replicate(5, rnorm(10)))

head(data_to_long(wide_data))
#>   Name       Value
#> 1   X1 -0.08281164
#> 2   X2 -1.12490028
#> 3   X3 -0.70632036
#> 4   X4 -0.70278946
#> 5   X5  0.07633326
#> 6   X1  1.93468099

or the other way

long_data <- data_to_long(wide_data, rows_to = "Row_ID") # Save row number

data_to_wide(long_data,
  colnames_from = "Name",
  values_from = "Value",
  rows_from = "Row_ID"
)
#>    Row_ID    Value_X1    Value_X2    Value_X3   Value_X4    Value_X5
#> 1       1 -0.08281164 -1.12490028 -0.70632036 -0.7027895  0.07633326
#> 2       2  1.93468099 -0.87430362  0.96687656  0.2998642 -0.23035595
#> 3       3 -2.05128979  0.04386162 -0.71016648  1.1494697  0.31746484
#> 4       4  0.27773897 -0.58397514 -0.05917365 -0.3016415 -1.59268440
#> 5       5 -1.52596060 -0.82329858 -0.23094342 -0.5473394 -0.18194062
#> 6       6 -0.26916362  0.11059280  0.69200045 -0.3854041  1.75614174
#> 7       7  1.23305388  0.36472778  1.35682290  0.2763720  0.11394932
#> 8       8  0.63360774  0.05370100  1.78872284  0.1518608 -0.29216508
#> 9       9  0.35271746  1.36867235  0.41071582 -0.4313808  1.75409316
#> 10     10 -0.56048248 -0.38045724 -2.18785470 -1.8705001  1.80958455

Empty rows and columns

tmp <- data.frame(
  a = c(1, 2, 3, NA, 5),
  b = c(1, NA, 3, NA, 5),
  c = c(NA, NA, NA, NA, NA),
  d = c(1, NA, 3, NA, 5)
)

tmp
#>    a  b  c  d
#> 1  1  1 NA  1
#> 2  2 NA NA NA
#> 3  3  3 NA  3
#> 4 NA NA NA NA
#> 5  5  5 NA  5

# indices of empty columns or rows
empty_columns(tmp)
#> c 
#> 3
empty_rows(tmp)
#> [1] 4

# remove empty columns or rows
remove_empty_columns(tmp)
#>    a  b  d
#> 1  1  1  1
#> 2  2 NA NA
#> 3  3  3  3
#> 4 NA NA NA
#> 5  5  5  5
remove_empty_rows(tmp)
#>   a  b  c  d
#> 1 1  1 NA  1
#> 2 2 NA NA NA
#> 3 3  3 NA  3
#> 5 5  5 NA  5

# remove empty columns and rows
remove_empty(tmp)
#>   a  b  d
#> 1 1  1  1
#> 2 2 NA NA
#> 3 3  3  3
#> 5 5  5  5

Recode or cut dataframe

set.seed(123)
x <- sample(1:10, size = 50, replace = TRUE)

table(x)
#> x
#>  1  2  3  4  5  6  7  8  9 10 
#>  2  3  5  3  7  5  5  2 11  7

# cut into 3 groups, based on distribution (quantiles)
table(data_cut(x, split = "quantile", n_groups = 3))
#> 
#>  1  2  3 
#> 13 19 18

Data Transformations

The packages also contains multiple functions to help transform data.

Standardize

For example, to standardize (z-score) data:

# before
summary(swiss)
#>    Fertility      Agriculture     Examination      Education    
#>  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
#>  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
#>  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
#>  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
#>  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
#>  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
#>     Catholic       Infant.Mortality
#>  Min.   :  2.150   Min.   :10.80   
#>  1st Qu.:  5.195   1st Qu.:18.15   
#>  Median : 15.140   Median :20.00   
#>  Mean   : 41.144   Mean   :19.94   
#>  3rd Qu.: 93.125   3rd Qu.:21.70   
#>  Max.   :100.000   Max.   :26.60

# after
summary(standardize(swiss))
#>    Fertility         Agriculture       Examination         Education      
#>  Min.   :-2.81327   Min.   :-2.1778   Min.   :-1.69084   Min.   :-1.0378  
#>  1st Qu.:-0.43569   1st Qu.:-0.6499   1st Qu.:-0.56273   1st Qu.:-0.5178  
#>  Median : 0.02061   Median : 0.1515   Median :-0.06134   Median :-0.3098  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.66504   3rd Qu.: 0.7481   3rd Qu.: 0.69074   3rd Qu.: 0.1062  
#>  Max.   : 1.78978   Max.   : 1.7190   Max.   : 2.57094   Max.   : 4.3702  
#>     Catholic       Infant.Mortality  
#>  Min.   :-0.9350   Min.   :-3.13886  
#>  1st Qu.:-0.8620   1st Qu.:-0.61543  
#>  Median :-0.6235   Median : 0.01972  
#>  Mean   : 0.0000   Mean   : 0.00000  
#>  3rd Qu.: 1.2464   3rd Qu.: 0.60337  
#>  Max.   : 1.4113   Max.   : 2.28566

Winsorize

To winsorize data:

# before
anscombe
#>    x1 x2 x3 x4    y1   y2    y3    y4
#> 1  10 10 10  8  8.04 9.14  7.46  6.58
#> 2   8  8  8  8  6.95 8.14  6.77  5.76
#> 3  13 13 13  8  7.58 8.74 12.74  7.71
#> 4   9  9  9  8  8.81 8.77  7.11  8.84
#> 5  11 11 11  8  8.33 9.26  7.81  8.47
#> 6  14 14 14  8  9.96 8.10  8.84  7.04
#> 7   6  6  6  8  7.24 6.13  6.08  5.25
#> 8   4  4  4 19  4.26 3.10  5.39 12.50
#> 9  12 12 12  8 10.84 9.13  8.15  5.56
#> 10  7  7  7  8  4.82 7.26  6.42  7.91
#> 11  5  5  5  8  5.68 4.74  5.73  6.89

# after
winsorize(anscombe)
#>       x1 x2 x3 x4   y1   y2   y3   y4
#>  [1,] 10 10 10  8 8.04 9.13 7.46 6.58
#>  [2,]  8  8  8  8 6.95 8.14 6.77 5.76
#>  [3,] 12 12 12  8 7.58 8.74 8.15 7.71
#>  [4,]  9  9  9  8 8.81 8.77 7.11 8.47
#>  [5,] 11 11 11  8 8.33 9.13 7.81 8.47
#>  [6,] 12 12 12  8 8.81 8.10 8.15 7.04
#>  [7,]  6  6  6  8 7.24 6.13 6.08 5.76
#>  [8,]  6  6  6  8 5.68 6.13 6.08 8.47
#>  [9,] 12 12 12  8 8.81 9.13 8.15 5.76
#> [10,]  7  7  7  8 5.68 7.26 6.42 7.91
#> [11,]  6  6  6  8 5.68 6.13 6.08 6.89

Center

To grand-mean center data

center(anscombe)
#>    x1 x2 x3 x4          y1         y2    y3         y4
#> 1   1  1  1 -1  0.53909091  1.6390909 -0.04 -0.9209091
#> 2  -1 -1 -1 -1 -0.55090909  0.6390909 -0.73 -1.7409091
#> 3   4  4  4 -1  0.07909091  1.2390909  5.24  0.2090909
#> 4   0  0  0 -1  1.30909091  1.2690909 -0.39  1.3390909
#> 5   2  2  2 -1  0.82909091  1.7590909  0.31  0.9690909
#> 6   5  5  5 -1  2.45909091  0.5990909  1.34 -0.4609091
#> 7  -3 -3 -3 -1 -0.26090909 -1.3709091 -1.42 -2.2509091
#> 8  -5 -5 -5 10 -3.24090909 -4.4009091 -2.11  4.9990909
#> 9   3  3  3 -1  3.33909091  1.6290909  0.65 -1.9409091
#> 10 -2 -2 -2 -1 -2.68090909 -0.2409091 -1.08  0.4090909
#> 11 -4 -4 -4 -1 -1.82090909 -2.7609091 -1.77 -0.6109091

Ranktransform

To rank-transform data:

# before
head(trees)
#>   Girth Height Volume
#> 1   8.3     70   10.3
#> 2   8.6     65   10.3
#> 3   8.8     63   10.2
#> 4  10.5     72   16.4
#> 5  10.7     81   18.8
#> 6  10.8     83   19.7

# after
head(ranktransform(trees))
#>   Girth Height Volume
#> 1     1    6.0    2.5
#> 2     2    3.0    2.5
#> 3     3    1.0    1.0
#> 4     4    8.5    5.0
#> 5     5   25.5    7.0
#> 6     6   28.0    9.0

Rescale

To rescale a numeric variable to a new range:

change_scale(c(0, 1, 5, -5, -2))
#> [1]  50  60 100   0  30

Rotate or transpose

x <- mtcars[1:3, 1:4]

x
#>                mpg cyl disp  hp
#> Mazda RX4     21.0   6  160 110
#> Mazda RX4 Wag 21.0   6  160 110
#> Datsun 710    22.8   4  108  93

data_rotate(x)
#>      Mazda RX4 Mazda RX4 Wag Datsun 710
#> mpg         21            21       22.8
#> cyl          6             6        4.0
#> disp       160           160      108.0
#> hp         110           110       93.0

Data properties

datawizard provides a way to provide comprehensive descriptive summary for all variables in a dataframe:

data(iris)
describe_distribution(iris)
#> Variable     | Mean |   SD |  IQR | Min | Max | Skewness | Kurtosis |   n | n_Missing
#> -------------------------------------------------------------------------------------
#> Sepal.Length |  5.8 | 0.83 | 1.30 | 4.3 | 7.9 |     0.31 |    -0.55 | 150 |         0
#> Sepal.Width  |  3.1 | 0.44 | 0.52 | 2.0 | 4.4 |     0.32 |     0.23 | 150 |         0
#> Petal.Length |  3.8 | 1.77 | 3.52 | 1.0 | 6.9 |    -0.27 |    -1.40 | 150 |         0
#> Petal.Width  |  1.2 | 0.76 | 1.50 | 0.1 | 2.5 |    -0.10 |    -1.34 | 150 |         0

Or even just a variable

describe_distribution(mtcars$wt)
#> Mean |   SD | IQR | Min | Max | Skewness | Kurtosis |  n | n_Missing
#> --------------------------------------------------------------------
#> 3.2  | 0.98 | 1.2 | 1.5 | 5.4 |     0.47 |     0.42 | 32 |         0

There are also some additional data properties that can be computed using this package.

x <- (-10:10)^3 + rnorm(21, 0, 100)
smoothness(x, method = "diff")
#> [1] 1.791243
#> attr(,"class")
#> [1] "parameters_smoothness" "numeric"

Function design and pipe-workflow

The design of the {datawizard} functions follows a design principle that makes it easy for user to understand and remember how functions work:

  1. the first argument is the data
  2. the following arguments are main arguments, related to the specific tasks of the functions
  3. further arguments can be select-helpers, which are offered for convenience reasons (so there is no need for interim calls to get_columns())

E.g., in data_filter(), the main arguments after the data-argument are assumed to filter the rows of a data frame. data_cut() recodes data into groups of values and hence the main argument following the data-argument are used to define the breaks for grouping variables. Most functions, however, in particular (but not limited to) functions that start with data_*(), usually select columns from the provided data frame (and thus also support select-helpers).

Most important, functions that accept data frame usually have this as their first argument, and also return a (modified) data frame again. Thus, {datawizard} integrates smoothly into a “pipe-workflow”.

iris |> 
  # all rows where Species is "versicolor" or "virginica"
  data_filter(Species %in% c("versicolor", "virginica")) |> 
  # select only columns with "." in names (i.e. drop Species)
  get_columns(contains(".")) |> 
  # move columns that ends with "Length" to start of data frame
  data_relocate(ends_with("Length")) |> 
  # remove fourth column
  data_remove(4) |> 
  head()
#>    Sepal.Length Petal.Length Sepal.Width
#> 51          7.0          4.7         3.2
#> 52          6.4          4.5         3.2
#> 53          6.9          4.9         3.1
#> 54          5.5          4.0         2.3
#> 55          6.5          4.6         2.8
#> 56          5.7          4.5         2.8

Contributing and Support

In case you want to file an issue or contribute in another way to the package, please follow this guide. For questions about the functionality, you may either contact us via email or also file an issue.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Copy Link

Version

Install

install.packages('datawizard')

Monthly Downloads

142,772

Version

0.4.0

License

GPL-3

Maintainer

Indrajeet Patil

Last Published

March 30th, 2022

Functions in datawizard (0.4.0)

center

Centering (Grand-Mean Centering)
data_cut

Recode (or "cut") data into groups of values.
convert_data_to_numeric

Convert data to numeric
data_match

Return filtered data frame or row indices
convert_to_na

Convert non-missing values in a variable into missing values.
adjust

Adjust data for the effect of other variable(s)
data_extract

Extract one or more columns or elements from an object
convert_na_to

Replace missing values in a variable or a dataframe.
data_merge

Merge (join) two data frames, or a list of data frames
data_partition

Partition data into a test and a training set
data_relocate

Relocate (reorder) columns of a data frame
data_rotate

Rotate a data frame
data_reverse

Reverse-Score Variables
rescale_weights

Rescale design weights for multilevel analysis
reshape_ci

Reshape CI between wide/long formats
reexports

Objects exported from other packages
ranktransform

(Signed) rank transformation
winsorize

Winsorize data
format_text

Convenient text formatting functionalities
find_columns

Find or get columns in a data frame based on search patterns
data_to_long

Reshape (pivot) data from wide to long
weighted_mean

Weighted Mean, Median, SD, and MAD
demean

Compute group-meaned and de-meaned variables
data_rescale

Rescale Variables to a New Range
smoothness

Quantify the smoothness of a vector
standardize

Standardization (Z-scoring)
data_addprefix

Rename columns and variable names
data_restoretype

Restore the type of columns according to a reference data frame
describe_distribution

Describe a distribution
efc

Sample dataset from the EFC Survey
remove_empty

Return or remove variables or observations that are completely missing
replace_nan_inf

Convert infinite or NaN values into NA
rownames_as_column

Tools for working with row names
to_numeric

Convert to Numeric (if possible)
nhanes_sample

Sample dataset from the National Health and Nutrition Examination Survey
normalize

Normalize numeric variable to 0-1 range
visualisation_recipe

Prepare objects for visualisation
skewness

Compute Skewness and (Excess) Kurtosis