multiplyr: Data Manipulation with Parellelism and Shared Memory Matrices

Description

Provides a new form of data frame backed by shared memory matrices and a way to manipulate them. Upon creation these data frames are shared across multiple local nodes to allow for simple parallel processing. Run the following command for a more thorough explanation: vignette("basics")

Arguments

Major differences from dplyr

summarise with dplyr will return a single number, but here it will return N values depending on how many nodes there are. Typically you should follow summarise with reduce, which is run locally.

Standard dplyr-like functions

arrange

Sort data

distinct

Select unique rows or unique combinations of variables

filter

Filter data

group_by

Group data

group_sizes

Return size of groups

groupwise

Use grouped data (also known as ungroup)

mutate

Change values of existing variables (and create new ones)

n_groups

Return number of groups

rename

Rename variables

rowwise

Use data as individual rows

select

Retain only specified variables

slice

Select rows by position

summarise

Summarise data

transmute

Change variables and drop all others

Parallel functions

partition_even

Partition data evenly amongst cluster nodes

partition_group

Partition data so that each group is wholly on a node

within_group

Execute code within a group

within_node

Execute code within a group

Additional data frame functions

Multiplyr

Create new parallel data frame

define

Define new variables

nsa

No strings attached mode

reduce

Summarise locally only

regroup

Return to grouped data

undefine

Delete variables

Data manipulation adjuncts

between

Tests whether elements of a vector lie between two values (inclusively)

cumall

Cumulative all

cumany

Cumulative any

cummean

Cumulative mean

first

Returns first value in vector

last

Returns last value in vector

lag

Offset x backwards by n

lead

Offset x forwards by n

n

Number of items in current group

nth

Return the nth item from a vector