Learn R Programming

⚠️There's a newer version (0.8.3) of this package.Take me there.

disk.frame

NOTICE

{disk.frame} has been soft-deprecated in favor of {arrow}. With the {arrow} 6.0.0 release, it’s now capable of doing larger-than-RAM data analysis quite well see release note. Hence, there is no strong reason to prefer {disk.frame} unless you have very specific feature needs.

For the above reason, I’ve decided to soft-deprecate {disk.frame} which means I will no longer actively develop new features for it but it will remain on CRAN in maintenance mode.

To help with the transition I’ve created a function, disk.frame::disk.frame_to_parquet(df, outdir) to help you convert existing {disk.frame}s to the parquet format so you can use {arrow} with it.

I am working on an reincarnation of {disk.frame} in Julia, so the {disk.frame} will live on!

Thank your for support {disk.frame}. I’ve learnt alot along the way, but time has come to move on!

Introduction

How do I manipulate tabular data that doesn’t fit into Random Access Memory (RAM)?

Use {disk.frame}!

In a nutshell, {disk.frame} makes use of two simple ideas

  1. split up a larger-than-RAM dataset into chunks and store each chunk in a separate file inside a folder and
  2. provide a convenient API to manipulate these chunks

{disk.frame} performs a similar role to distributed systems such as Apache Spark, Python’s Dask, and Julia’s JuliaDB.jl for medium data which are datasets that are too large for RAM but not quite large enough to qualify as big data.

Installation

You can install the released version of {disk.frame} from CRAN with:

install.packages("disk.frame")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("DiskFrame/disk.frame")

On some platforms, such as SageMaker, you may need to explicitly specify a repo like this

install.packages("disk.frame", repo="https://cran.rstudio.com")

Vignettes and articles

Please see these vignettes and articles about {disk.frame}

Common questions

a) What is {disk.frame} and why create it?

{disk.frame} is an R package that provides a framework for manipulating larger-than-RAM structured tabular data on disk efficiently. The reason one would want to manipulate data on disk is that it allows arbitrarily large datasets to be processed by R. In other words, we go from “R can only deal with data that fits in RAM” to “R can deal with any data that fits on disk”. See the next section.

b) How is it different to data.frame and data.table?

A data.frame in R is an in-memory data structure, which means that R must load the data in its entirety into RAM. A corollary of this is that only data that can fit into RAM can be processed using data.frames. This places significant restrictions on what R can process with minimal hassle.

In contrast, {disk.frame} provides a framework to store and manipulate data on the hard drive. It does this by loading only a small part of the data, called a chunk, into RAM; process the chunk, write out the results and repeat with the next chunk. This chunking strategy is widely applied in other packages to enable processing large amounts of data in R, for example, see chunkded arkdb, and iotools.

Furthermore, there is a row-limit of 2^31 for data.frames in R; hence an alternate approach is needed to apply R to these large datasets. The chunking mechanism in {disk.frame} provides such an avenue to enable data manipulation beyond the 2^31 row limit.

c) How is {disk.frame} different to previous “big” data solutions for R?

R has many packages that can deal with larger-than-RAM datasets, including ff and bigmemory. However, ff and bigmemory restrict the user to primitive data types such as double, which means they do not support character (string) and factor types. In contrast, {disk.frame} makes use of data.table::data.table and data.frame directly, so all data types are supported. Also, {disk.frame} strives to provide an API that is as similar to data.frame’s where possible. {disk.frame} supports many dplyr verbs for manipulating disk.frames.

Additionally, {disk.frame} supports parallel data operations using infrastructures provided by the excellent future package to take advantage of multi-core CPUs. Further, {disk.frame} uses state-of-the-art data storage techniques such as fast data compression, and random access to rows and columns provided by the fst package to provide superior data manipulation speeds.

d) How does {disk.frame} work?

{disk.frame} works by breaking large datasets into smaller individual chunks and storing the chunks in fst files inside a folder. Each chunk is a fst file containing a data.frame/data.table. One can construct the original large dataset by loading all the chunks into RAM and row-bind all the chunks into one large data.frame. Of course, in practice this isn’t always possible; hence why we store them as smaller individual chunks.

{disk.frame} makes it easy to manipulate the underlying chunks by implementing dplyr functions/verbs and other convenient functions (e.g. the cmap(a.disk.frame, fn, lazy = F) function which applies the function fn to each chunk of a.disk.frame in parallel). So that {disk.frame} can be manipulated in a similar fashion to in-memory data.frames.

e) How is {disk.frame} different from Spark, Dask, and JuliaDB.jl?

Spark is primarily a distributed system that also works on a single machine. Dask is a Python package that is most similar to {disk.frame}, and JuliaDB.jl is a Julia package. All three can distribute work over a cluster of computers. However, {disk.frame} currently cannot distribute data processes over many computers, and is, therefore, single machine focused.

In R, one can access Spark via sparklyr, but that requires a Spark cluster to be set up. On the other hand {disk.frame} requires zero-setup apart from running install.packages("disk.frame") or devtools::install_github("xiaodaigh/disk.frame").

Finally, Spark can only apply functions that are implemented for Spark, whereas {disk.frame} can use any function in R including user-defined functions.

Example usage

Set-up {disk.frame}

{disk.frame} works best if it can process multiple data chunks in parallel. The best way to set-up {disk.frame} so that each CPU core runs a background worker is by using

setup_disk.frame()

# this allows large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)

The setup_disk.frame() sets up background workers equal to the number of CPU cores; please note that, by default, hyper-threaded cores are counted as one not two.

Alternatively, one may specify the number of workers using setup_disk.frame(workers = n).

Quick-start

suppressPackageStartupMessages(library(disk.frame))
library(nycflights13)

# this will setup disk.frame's parallel backend with number of workers equal to the number of CPU cores (hyper-threaded cores are counted as one not two)
setup_disk.frame()
#> The number of workers available for disk.frame is 6
# this allows large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)

# convert the flights data.frame to a disk.frame
# optionally, you may specify an outdir, otherwise, the 
flights.df <- as.disk.frame(nycflights13::flights)
#> fstcore package v0.9.8
#> (OpenMP detected, using 12 threads)

Example: dplyr verbs

dplyr verbs

{disk.frame} aims to support as many dplyr verbs as possible. For example

flights.df %>% 
  filter(year == 2013) %>% 
  mutate(origin_dest = paste0(origin, dest)) %>% 
  head(2)
#>    year month day dep_time sched_dep_time dep_delay arr_time
#> 1: 2013     1   1      517            515         2      830
#> 2: 2013     1   1      533            529         4      850
#>    sched_arr_time arr_delay carrier flight tailnum origin dest
#> 1:            819        11      UA   1545  N14228    EWR  IAH
#> 2:            830        20      UA   1714  N24211    LGA  IAH
#>    air_time distance hour minute           time_hour origin_dest
#> 1:      227     1400    5     15 2013-01-01 05:00:00      EWRIAH
#> 2:      227     1416    5     29 2013-01-01 05:00:00      LGAIAH

Group-by

Starting from {disk.frame} v0.3.0, there is group_by support for a limited set of functions. For example:

result_from_disk.frame = iris %>% 
  as.disk.frame %>% 
  group_by(Species) %>% 
  summarize(
    mean(Petal.Length), 
    sumx = sum(Petal.Length/Sepal.Width), 
    sd(Sepal.Width/ Petal.Length), 
    var(Sepal.Width/ Sepal.Width), 
    l = length(Sepal.Width/ Sepal.Width + 2),
    max(Sepal.Width), 
    min(Sepal.Width), 
    median(Sepal.Width)
    ) %>% 
  collect

The results should be exactly the same as if applying the same group-by operations on a data.frame. If not, please report a bug.

List of supported group-by functions

If a function you like is missing, please make a feature request here. It is a limitation that function that depend on the order a column can only be obtained using estimated methods.

FunctionExact/EstimateNotes
minExact
maxExact
meanExact
sumExact
lengthExact
nExact
n_distinctExact
sdExact
varExactvar(x) only cor, cov support planned
anyExact
allExact
medianEstimate
quantileEstimateOne quantile only
IQREstimate

Example: data.table syntax

library(data.table)
#> data.table 1.14.2 using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last

suppressWarnings(
  grp_by_stage1 <- 
    flights.df[
      keep = c("month", "distance"), # this analysis only required "month" and "dist" so only load those
      month <= 6, 
      .(sum_dist = sum(distance)), 
      .(qtr = ifelse(month <= 3, "Q1", "Q2"))
      ]
)
#> data.table syntax for disk.frame may be moved to a separate package in the future

grp_by_stage1
#>    qtr sum_dist
#> 1:  Q1 27188805
#> 2:  Q1   953578
#> 3:  Q1 53201567
#> 4:  Q2  3383527
#> 5:  Q2 58476357
#> 6:  Q2 27397926

The result grp_by_stage1 is a data.table so we can finish off the two-stage aggregation using data.table syntax

grp_by_stage2 = grp_by_stage1[,.(sum_dist = sum(sum_dist)), qtr]

grp_by_stage2
#>    qtr sum_dist
#> 1:  Q1 81343950
#> 2:  Q2 89257810

Basic info

To find out where the disk.frame is stored on disk:

# where is the disk.frame stored
attr(flights.df, "path")
#> [1] "C:\\Users\\RTX2080\\AppData\\Local\\Temp\\RtmpQNXBdM\\file3c7837c3338d.df"

A number of data.frame functions are implemented for disk.frame

# get first few rows
head(flights.df, 1)
#>    year month day dep_time sched_dep_time dep_delay arr_time
#> 1: 2013     1   1      517            515         2      830
#>    sched_arr_time arr_delay carrier flight tailnum origin dest
#> 1:            819        11      UA   1545  N14228    EWR  IAH
#>    air_time distance hour minute           time_hour
#> 1:      227     1400    5     15 2013-01-01 05:00:00
# get last few rows
tail(flights.df, 1)
#>    year month day dep_time sched_dep_time dep_delay arr_time
#> 1: 2013     9  30       NA            840        NA       NA
#>    sched_arr_time arr_delay carrier flight tailnum origin dest
#> 1:           1020        NA      MQ   3531  N839MQ    LGA  RDU
#>    air_time distance hour minute           time_hour
#> 1:       NA      431    8     40 2013-09-30 08:00:00
# number of rows
nrow(flights.df)
#> [1] 336776
# number of columns
ncol(flights.df)
#> [1] 19

Hex logo

Contributors

This project exists thanks to all the people who contribute.

Current Priorities

The work priorities at this stage are

  1. Bugs
  2. Urgent feature implementations that can improve an awful user-experience
  3. More vignettes covering every aspect of disk.frame
  4. Comprehensive Tests
  5. Comprehensive Documentation
  6. More features

Blogs and other resources

TitleLanguageAuthorDateDescription
25 days of disk.frameEnglishZJ2019-12-0125 tweets about {disk.frame}
https://www.researchgate.net/post/What-is-the-Maximum-size-of-data-that-is-supported-by-R-dataminingEnglishKnut Jägersberg2019-11-11Great answer on using disk.frame
{disk.frame} is epicEnglishBruno Rodriguez2019-09-03It’s about loading a 30G file into {disk.frame}
My top 10 R packages for data analyticsEnglishJacky Poon2019-09-03{disk.frame} was number 3
useR! 2019 presentation videoEnglishDai ZJ2019-08-03
useR! 2019 presentation slidesEnglishDai ZJ2019-08-03
Split-apply-combine for Maximum Likelihood Estimation of a linear modelEnglishBruno Rodriguez2019-10-06{disk.frame} used in helping to create a maximum likelihood estimation program for linear models
Emma goes to useR! 2019EnglishEmma Vestesson2019-07-16The first mention of {disk.frame} in a blog post
深入对比数据科学工具箱:Python3 和 R 之争(2020版)ChineseHarry Zhu2020-02-16Mentions disk.frame

Interested in learning {disk.frame} in a structured course?

Please register your interest at:

https://leanpub.com/c/taminglarger-than-ramwithdiskframe

Open Collective

If you like {disk.frame} and want to speed up its development or perhaps you have a feature request? Please consider sponsoring {disk.frame} on Open Collective

Backers

Thank you to all our backers!

Sponsor and back {disk.frame}

Support {disk.frame} development by becoming a sponsor. Your logo will show up here with a link to your website.

Contact me for consulting

Do you need help with machine learning and data science in R, Python, or Julia? I am available for Machine Learning/Data Science/R/Python/Julia consulting! Email me

Non-financial ways to contribute

Do you wish to give back the open-source community in non-financial ways? Here are some ways you can contribute

  • Write a blogpost about your {disk.frame} usage or experience. I would love to learn more about how {disk.frame} has helped you
  • Tweet or post on social media (e.g LinkedIn) about {disk.frame} to help promote it
  • Bring attention to typos and grammatical errors by correcting and making a PR. Or simply by raising an issue here
  • Star the {disk.frame} Github repo
  • Star any repo that {disk.frame} depends on e.g. {fst} and {future}

Related Repos

https://github.com/DiskFrame/disk.frame-fannie-mae-example https://github.com/DiskFrame/disk.frame-vs https://github.com/DiskFrame/disk.frame.ml

Download Counts & Build Status

Copy Link

Version

Install

install.packages('disk.frame')

Monthly Downloads

742

Version

0.7.1

License

MIT + file LICENSE

Maintainer

Dai ZJ

Last Published

February 14th, 2022

Functions in disk.frame (0.7.1)

as.data.table.disk.frame

Convert disk.frame to data.table by collecting all chunks
bind_rows.disk.frame

Bind rows
cmap

Apply the same function to all chunks
add_chunk

Add a chunk to the disk.frame
as.data.frame.disk.frame

Convert disk.frame to data.frame by collecting all chunks
collect.disk.frame

Bring the disk.frame into R
chunk_summarize

#' @export #' @importFrom dplyr add_count #' @rdname dplyr_verbs add_count.disk.frame <- create_chunk_mapper(dplyr::add_count) #' @export #' @importFrom dplyr add_tally #' @rdname dplyr_verbs add_tally.disk.frame <- create_chunk_mapper(dplyr::add_tally)
cmap2

`cmap2` a function to two disk.frames
as.disk.frame

Make a data.frame into a disk.frame
colnames

Return the column names of the disk.frame
select.disk.frame

The dplyr verbs implemented for disk.frame
compute.disk.frame

Force computations. The results are stored in a folder.
dfglm

Fit generalized linear models (glm) with disk.frame
df_ram_size

Get the size of RAM in gigabytes
create_chunk_mapper

Create function that applies to each chunk if disk.frame
get_chunk

Obtain one chunk by chunk id
gen_datatable_synthetic

Generate synthetic dataset for testing
anti_join.disk.frame

Performs join/merge for disk.frames
make_glm_streaming_fn

A streaming function for speedglm
get_chunk_ids

Get the chunk IDs and files names
nchunks

Returns the number of chunks in a disk.frame
get_partition_paths

Get the partitioning structure of a folder
head.disk.frame

Head and tail of the disk.frame
disk.frame

Create a disk.frame from a folder
evalparseglue

Helper function to evalparse some `glue::glue` string
csv_to_disk.frame

Convert CSV file(s) to disk.frame format
is_disk.frame

Checks if a folder is a disk.frame
setup_disk.frame

Set up disk.frame environment
nrow

Number of rows or columns
shard

Shard a data.frame/data.table or disk.frame into chunk and saves it into a disk.frame
zip_to_disk.frame

`zip_to_disk.frame` is used to read and convert every CSV file within the zip file to disk.frame format
partition_filter

Filter the dataset based on folder partitions
rechunk

Increase or decrease the number of chunks in the disk.frame
play

Play the recorded lazy operations
disk.frame_to_parquet

A function to convert a disk.frame to parquet format
recommend_nchunks

Recommend number of chunks based on input size
split_string_into_df

Turn a string of the form /partion1=val/partion2=val2 into data.frame
show_ceremony

Show the code to setup disk.frame
move_to

Move or copy a disk.frame to another location
purrr_as_mapper

Used to convert a function to purrr syntax if needed
merge.disk.frame

Merge function for disk.frames
delete

Delete a disk.frame
groups.disk.frame

The shard keys of the disk.frame
remove_chunk

Removes a chunk from the disk.frame
summarise.grouped_disk.frame

A function to parse the summarize function
rbindlist.disk.frame

rbindlist disk.frames together
find_globals_recursively

Find globals in an expression by searching through the chain
foverlaps.disk.frame

Apply data.table's foverlaps to the disk.frame
var_df.chunk_agg.disk.frame

One Stage function
print.disk.frame

Print disk.frame
overwrite_check

Check if the outdir exists or not
pull.disk.frame

Pull a column from table similar to `dplyr::pull`.
sample_frac.disk.frame

Sample n rows from a disk.frame
shardkey

Returns the shardkey (not implemented yet)
shardkey_equal

Compare two disk.frame shardkeys
tbl_vars.disk.frame

Column names for RStudio auto-complete
write_disk.frame

Write disk.frame to disk
srckeep

Keep only the variables from the input listed in selections
[.disk.frame

[ interface for disk.frame using fst backend