BiocCloud (version 0.99.3)

S3buckets: List the contents of an Amazon S3 Bucket


Lists some or all of the contents of an Amazon S3 Bucket.


listBucket(bucketName="1000genomes", wantDirs=TRUE,
        prefix="release/20101123/interim_phase1_release/", delimiter="/")


The name of the S3 bucket whose contents to list.
Whether to return "directories" or just "flat files". If TRUE, items which are "directories" will have a trailing slash. See examples.
The "directory" of the bucket to list. For the top level, pass an empty string(""). For other levels, be sure that prefix contains a trailing slash (/).
The character which delimits "directories" in this S3 bucket.
Additional arguments to be passed to getBucketUrls.


  • listBucket The return value of listBucket is a character vector containing the names of the items in the bucket. getBucketUrls returns a named character vector of S3 URLs (the names are chromosome names such as 'chr1').


Amazon S3 is a file storage service. Files are stored in "buckets". S3 is not a filesystem and does not explicitly support directories, but it allows you to treat a bucket as though it has a directory structure. For example, if you have a bucket with an item in it called "foo/bar/baz.txt", there is an implicit "foo/" directory, with an implicit "bar/" directory under that.

If a bucket has many thousands of objects in it, it can take a while to list the entire contents, and these functions do not do so. These functions support listing a subset of the bucket contents by a given prefix. These listings are not recursive, but they provide the information you need to interactively query the contents of a bucket.

getBucketUrls is designed specifically for use with the BiocCloud package and the files in the "1000genomes" buckets. In addition to the bare listing that listBucket provides, getBucketUrls returns a named list containing a subset of bucket contents (those files that contain the pattern "ALL.chr[0-9]{1,2}"), where the names are the chromosomes (e.g. "chr1"). The values in the list are fully qualified URLs instead of just S3 key names. See the vignette for a use case that takes advantage of this.


Run this code
## list the top-level contents of the 1000genomes bucket:

## It contains a pilot_data/ "directory", which we can examine:

## Retrieve a named list of URLs suitable for parallel processing

Run the code above in your browser using DataLab