Learn R Programming

tidyfst: Tidy Verbs for Fast Data Manipulation

Overview

tidyfst is a toolkit of tidy data manipulation verbs with data.table as the backend . Combining the merits of syntax elegance from dplyr and computing performance from data.table, tidyfst intends to provide users with state-of-the-art data manipulation tools with least pain. This package is an extension of data.table, while enjoying a tidy syntax, it also wraps combinations of efficient functions to facilitate frequently-used data operations. Also, tidyfst would introduce more tidy data verbs from other packages, including but not limited to tidyverse and data.table. If you are a dplyr user but have to use data.table for speedy computation, or data.table user looking for readable coding syntax, tidyfst is designed for you (and me of course). For further details and tutorials, see vignettes. Both Chinese and English tutorials could be found there.

Till now, tidyfst has an API that might even transcend its predecessors (e.g. select_dt could accept nearly anything for super column selection). Enjoy the efficient data operations in tidyfst !

PS: For extreme performance in tidy syntax, try tidyfst's mirror package tidyft.

Features

  • Receives any data.frame (tibble/data.table/data.frame) and returns a data.table.
  • Show the variable class of data.table as default.
  • Never use in place replacement (also known as modification by reference, which means the original variable would not be modified without notification).
  • Use suffix ("_dt") rather than prefix to increase the efficiency (especially when you have IDE with automatic code completion).
  • More flexible verbs (e.g. pairwise_count_dt) for big data manipulation.
  • Supporting data importing and parsing with fst, which saves both time and memory. Details see parse_fst/select_fst/filter_fst and import_fst/export_fst.
  • Low and stable dependency on mature packages (data.table, fst, stringr)

Installation

install.packages("tidyfst")

Example

library(tidyfst)

iris %>%
  mutate_dt(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>%
  select_dt(group,sl,sw) %>%
  filter_dt(sl > 5) %>%
  arrange_dt(group,sl) %>%
  distinct_dt(sl,.keep_all = T) %>%
  summarise_dt(sw = max(sw),by = group)
#>         group  sw
#>        <fctr> <num>
#> 1:     setosa 4.4
#> 2: versicolor 3.4
#> 3:  virginica 3.8

iris %>%
  count_dt(Species) %>%
  add_prop()
#>       Species     n      prop prop_label
#>        <fctr> <int>     <num>     <char>
#> 1:     setosa    50 0.3333333      33.3%
#> 2: versicolor    50 0.3333333      33.3%
#> 3:  virginica    50 0.3333333      33.3%

iris[3:8,] %>%
  mutate_when(Petal.Width == .2,
              one = 1,Sepal.Length=2)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species one
#>          <num>       <num>        <num>       <num>  <fctr> <num>
#> 1:          2.0         3.2          1.3         0.2  setosa   1
#> 2:          2.0         3.1          1.5         0.2  setosa   1
#> 3:          2.0         3.6          1.4         0.2  setosa   1
#> 4:          5.4         3.9          1.7         0.4  setosa  NA
#> 5:          4.6         3.4          1.4         0.3  setosa  NA
#> 6:          2.0         3.4          1.5         0.2  setosa   1

Future plans

tidyfst will keep up with the updates of data.table , in the next step would introduce more new features to improve the performance and flexibility to facilitate fast data manipulation in tidy syntax.

Vignettes

Cheat sheet

Suggested citation

Huang et al., (2020). tidyfst: Tidy Verbs for Fast Data Manipulation. Journal of Open Source Software, 5(52), 2388, https://doi.org/10.21105/joss.02388

Related work

Acknowledgement

The author of maditr, Gregory Demin and the author of fst, Marcus Klik have helped me a lot in the development of this work. It is so lucky to have them (and many other selfless contributors) in the same open source community of R.

Copy Link

Version

Install

install.packages('tidyfst')

Monthly Downloads

2,181

Version

1.8.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Tian-Yuan Huang

Last Published

September 16th, 2024

Functions in tidyfst (1.8.1)

nest_dt

Nest and unnest
nth

Extract the nth value from a vector
print_options

Set global printing method for data.table
pkg_load

Load or unload R package(s)
longer_dt

Pivot data from wide to long
sample_dt

Sample rows randomly from a table
pull_dt

Pull out a single variable
relocate_dt

Change column order
lead_dt

Fast lead/lag for vectors
reexports

Objects exported from other packages
drop_na_dt

Dump, replace and fill missing values in data.frame
rec

Recode number or strings
rn_col

Tools for working with row names
pairwise_count_dt

Count pairs of items within a group
mutate_when

Conditional update of columns in data.table
round0

Round a number and make it show zeros
mutate_dt

Mutate columns in data.frame
object_size

Nice printing of report the Space Allocated for an Object
select_dt

Select column from data.frame
uncount_dt

"Uncount" a data frame
unite_dt

Unite multiple columns into one by pasting strings together
rename_dt

Rename column in data.frame
replace_dt

Fast value replacement in data frame
slice_dt

Subset rows using their positions
t_dt

Efficient transpose of data.frame
separate_dt

Separate a character column into two columns using a regular expression separator
sql_join

Case insensitive table joining like SQL
summarise_dt

Summarise columns to single values
mat_df

Conversion between tidy table and named matrix
utf8_encoding

Use UTF-8 for character encoding in a data frame
wider_dt

Pivot data from long to wide
sys_time_print

Convenient print of time taken
intersect_dt

Set operations for data frames
col_max

Get the column name of the max/min number each row
percent

Add percentage to counts in data.frame
as_fst

Save a data.frame as a fst table
distinct_dt

Select distinct/unique rows in data.frame
count_dt

Count observations by group
arrange_dt

Arrange entries in data.frame
cummean

Cumulative mean
bind_rows_dt

Bind multiple data frames by row
in_dt

Short cut to data.table
complete_dt

Complete a data frame with missing combinations of data
impute_dt

Impute missing values with mean, median or mode
import_fst_chunked

Read a fst file by chunks
group_by_dt

Group by variable(s) and implement operations
join

Join tables
fst

Parse,inspect and extract data.table from fst file
dummy_dt

Fast creation of dummy variables
export_fst

Read and write fst files
group_dt

Data manipulation within groups
filter_dt

Filter entries in data.frame