Partitioned data frames for 'dplyr'
A dplyr backend that partitions a data frame across multiple nodes
in a cluster (e.g. cores on your computer) to make common operations
multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with
partition() and then the data stays on each node until you explicitly retrieve it with
collect(). This minimises the amount of time spent moving data around, and maximises parallel performance. This idea is inspired by partools by Norm Matloff and distributedR by the Vertica Analytics team.
Due to the overhead associated with communicating between the nodes, you won't expect to see much performance improvement on basic dplyr verbs with less than ~10 million observations. However, you'll see improvements much faster if you're doing more complex operations with
To learn more, read the vignette.
To install from GitHub:
# install.packages("devtools") devtools::install_github("hadley/multidplyr")
Functions in multidplyr
|create_cluster||Create a new cluster with sensible defaults.|
|cluster_library||Attach a library on each node.|
|cluster_call||Call a function on each node of a cluster|
|cluster_eval||Evaluate arbitrary code on each node|
|reexports||Objects exported from other packages|
|partition||Partition data across a cluster.|
Include our badge in your README