Learn R Programming

s3fs

s3fs provides a file-system like interface into Amazon Web Services for R. It utilizes paws SDKand R6 for it’s core design. This repo has been inspired by Python’s s3fs, however it’s API and implementation has been developed to follow R’s fs.

Installation

You can install the released version of s3fs from CRAN with:

install.packages('s3fs')

r-universe installation:

# Enable repository from dyfanjones
options(repos = c(
  dyfanjones = 'https://dyfanjones.r-universe.dev',
  CRAN = 'https://cloud.r-project.org')
)

# Download and install s3fs in R
install.packages('s3fs')

Github installation

remotes::install_github("dyfanjones/s3fs")

Dependencies

  • paws: connection with AWS S3
  • R6: Setup core class
  • data.table: wrangle lists into data.frames
  • fs: file system on local files
  • lgr: set up logging
  • future: set up async functionality
  • future.apply: set up parallel looping

Comparison with fs

s3fs attempts to give the same interface as fs when handling files on AWS S3 from R.

  • Vectorization. All s3fs functions are vectorized, accepting multiple path inputs similar to fs.
  • Predictable.
    • Non-async functions return values that convey a path.
    • Async functions return a future object of it’s no-async counterpart.
    • The only exception will be s3_stream_in which returns a list of raw objects.
  • Naming conventions. s3fs functions follows fs naming conventions with dir_*, file_* and path_* however with the syntax s3_ infront i.e s3_dir_*, s3_file_* and s3_path_* etc.
  • Explicit failure. Similar to fs if a failure happens, then it will be raised and not masked with a warning.

Extra features:

  • Scalable. All s3fs functions are designed to have the option to run in parallel through the use of future and future.apply.

For example: copy a large file from one location to the next.

library(s3fs)
library(future)

plan("multisession")

s3_file_copy("s3://mybucket/multipart/large_file.csv", "s3://mybucket/new_location/large_file.csv")

s3fs to copy a large file (> 5GB) using multiparts, future allows each multipart to run in parallel to speed up the process.

  • Async. s3fs uses future to create a few key async functions. This is more focused on functions that might be moving large files to and from R and AWS S3.

For example: Copying a large file from AWS S3 to R.

library(s3fs)
library(future)

plan("multisession")

s3_file_copy_async("s3://mybucket/multipart/large_file.csv", "large_file.csv")

Usage

fs has a straight forward API with 4 core themes:

  • path_ for manipulating and constructing paths
  • file_ for files
  • dir_ for directories
  • link_ for links

s3fs follows theses themes with the following:

  • s3_path_ for manipulating and constructing s3 uri paths
  • s3_file_ for s3 files
  • s3_dir_ for s3 directories

NOTE: link_ is currently not supported.

library(s3fs)

# Construct a path to a file with `path()`
s3_path("foo", "bar", letters[1:3], ext = "txt")
#> [1] "s3://foo/bar/a.txt" "s3://foo/bar/b.txt" "s3://foo/bar/c.txt"

# list buckets
s3_dir_ls()
#> [1] "s3://MyBucket1"
#> [2] "s3://MyBucket2"                                        
#> [3] "s3://MyBucket3"               
#> [4] "s3://MyBucket4"                            
#> [5] "s3://MyBucket5"

# list files in bucket
s3_dir_ls("s3://MyBucket5")
#> [1] "s3://MyBucket5/iris.json"     "s3://MyBucket5/athena-query/"
#> [3] "s3://MyBucket5/data/"         "s3://MyBucket5/default/"     
#> [5] "s3://MyBucket5/iris/"         "s3://MyBucket5/made-up/"     
#> [7] "s3://MyBucket5/test_df/"

# create a new directory
tmp <- s3_dir_create(s3_file_temp(tmp_dir = "MyBucket5"))
tmp
#> [1] "s3://MyBucket5/filezwkcxx9q5562"

# create new files in that directory
s3_file_create(s3_path(tmp, "my-file.txt"))
#> [1] "s3://MyBucket5/filezwkcxx9q5562/my-file.txt"
s3_dir_ls(tmp)
#> [1] "s3://MyBucket5/filezwkcxx9q5562/my-file.txt"

# remove files from the directory
s3_file_delete(s3_path(tmp, "my-file.txt"))
s3_dir_ls(tmp)
#> character(0)

# remove the directory
s3_dir_delete(tmp)

Created on 2022-06-21 by the reprex package (v2.0.1)

Similar to fs, s3fs is designed to work well with the pipe.

library(s3fs)
paths <- s3_file_temp(tmp_dir = "MyBucket") |>
 s3_dir_create() |>
 s3_path(letters[1:5]) |>
 s3_file_create()
paths
#> [1] "s3://MyBucket/fileazqpwujaydqg/a"
#> [2] "s3://MyBucket/fileazqpwujaydqg/b"
#> [3] "s3://MyBucket/fileazqpwujaydqg/c"
#> [4] "s3://MyBucket/fileazqpwujaydqg/d"
#> [5] "s3://MyBucket/fileazqpwujaydqg/e"

paths |> s3_file_delete()
#> [1] "s3://MyBucket/fileazqpwujaydqg/a"
#> [2] "s3://MyBucket/fileazqpwujaydqg/b"
#> [3] "s3://MyBucket/fileazqpwujaydqg/c"
#> [4] "s3://MyBucket/fileazqpwujaydqg/d"
#> [5] "s3://MyBucket/fileazqpwujaydqg/e"

Created on 2022-06-22 by the reprex package (v2.0.1)

NOTE: all examples have be developed from fs.

File systems that emulate S3

s3fs allows you to connect to file systems that provides an S3-compatible interface. For example, MinIO offers high-performance, S3 compatible object storage. You will be able to connect to your MinIO server using s3fs::s3_file_system:

library(s3fs)

s3_file_system(
  aws_access_key_id = "minioadmin",  
  aws_secret_access_key = "minioadmin",
  endpoint = "http://localhost:9000"
)

s3_dir_ls()
#> [1] ""

s3_bucket_create("s3://testbucket")
#> [1] "s3://testbucket"

# refresh cache
s3_dir_ls(refresh = T)
#> [1] "s3://testbucket"

s3_bucket_delete("s3://testbucket")
#> [1] "s3://testbucket"

# refresh cache
s3_dir_ls(refresh = T)
#> [1] ""

Created on 2022-12-14 with reprex v2.0.2

NOTE: if you to want change from AWS S3 to Minio in the same R session, you will need to set the parameter refresh = TRUE when calling s3_file_system again. You can use multiple sessions by using the R6 class S3FileSystem directly.

Feedback wanted

Please open a Github ticket raising any issues or feature requests.

Copy Link

Version

Install

install.packages('s3fs')

Monthly Downloads

339

Version

0.1.7

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Dyfan Jones

Last Published

August 29th, 2024

Functions in s3fs (0.1.7)

s3_bucket_delete

Delete bucket
reexports

Objects exported from other packages
s3_dir_ls_url

Generate presigned url to list S3 directories
s3_file_move

Move or rename S3 files
s3_file_move_async

Move or rename S3 files
stream_async

Streams data from R to AWS S3.
tag

Modifying file tags
touch

Change file modification time
s3_file_temp

Create name for temporary files
s3_file_system

Access AWS S3 as if it were a file system.
upload

Upload file and directory
s3_file_url

Generate presigned url for S3 object
path_manipulate

Manipulate s3 uri paths
s3_dir_tree

Print contents of directories in a tree-like format
permission

Change file permissions
path

Construct path for file or directory
info

Get files and directories information
s3_file_version_info

Query file version metadata
s3fs-package

s3fs: 'Amazon Web Service S3' File System
s3_path_join

Construct AWS S3 path
upload_async

Upload file and directory
s3_path_split

Split s3 path and uri
stream

Streams data from R to AWS S3.
copy_async

Copy files and directories
delete

Delete files and directories
S3FileSystem

Access AWS S3 as if it were a file system.
exists

Download files and directories
file_type

Functions to test for file types
copy

Copy files and directories
download

Download files and directories
create

Create files and directories
download_async

Download files and directories
delete_async

Delete files and directories