S3buckets: List the contents of an Amazon S3 Bucket

Description

Lists some or all of the contents of an Amazon S3 Bucket.

Usage

listBucket(bucketName="1000genomes", wantDirs=TRUE,
        prefix="release/20101123/interim_phase1_release/", delimiter="/")
getBucketUrls(...)

Arguments

bucketName

The name of the S3 bucket whose contents to list.

wantDirs

Whether to return "directories" or just "flat files". If TRUE, items which are "directories" will have a trailing slash. See examples.

prefix

The "directory" of the bucket to list. For the top level, pass an empty string(""). For other levels, be sure that prefix contains a trailing slash (/).

delimiter

The character which delimits "directories" in this S3 bucket.

...

Additional arguments to be passed to getBucketUrls.

Value

listBucket The return value of listBucket is a character vector containing the names of the items in the bucket. getBucketUrls returns a named character vector of S3 URLs (the names are chromosome names such as 'chr1').

Details

Amazon S3 is a file storage service. Files are stored in "buckets". S3 is not a filesystem and does not explicitly support directories, but it allows you to treat a bucket as though it has a directory structure. For example, if you have a bucket with an item in it called "foo/bar/baz.txt", there is an implicit "foo/" directory, with an implicit "bar/" directory under that.

If a bucket has many thousands of objects in it, it can take a while to list the entire contents, and these functions do not do so. These functions support listing a subset of the bucket contents by a given prefix. These listings are not recursive, but they provide the information you need to interactively query the contents of a bucket.

getBucketUrls is designed specifically for use with the BiocCloud package and the files in the "1000genomes" buckets. In addition to the bare listing that listBucket provides, getBucketUrls returns a named list containing a subset of bucket contents (those files that contain the pattern "ALL.chr[0-9]{1,2}"), where the names are the chromosomes (e.g. "chr1"). The values in the list are fully qualified URLs instead of just S3 key names. See the vignette for a use case that takes advantage of this.

Examples

Run this code

## list the top-level contents of the 1000genomes bucket:
listBucket(prefix="")

## It contains a pilot_data/ "directory", which we can examine:
listBucket(prefix="pilot_data/")

## Retrieve a named list of URLs suitable for parallel processing
getBucketUrls()

Run the code above in your browser using DataLab