AnnotationHub-objects: AnnotationHub objects and their related methods and functions

Description

Use AnnotationHub to interact with Bioconductor's AnnotationHub service. Query the instance to discover and use resources that are of interest, and then easily download and import the resource into R for immediate use.

Use AnnotationHub() to retrieve information about all records in the hub.

Discover records in a hub using mcols(), query(), subset(), [, and display().

Retrieve individual records using [[. On first use of a resource, the corresponding files or other hub resources are downloaded from the internet to a local cache. On this and all subsequent uses the files are quickly input from the cache into the R session.

AnnotationHub records can be added (and sometimes removed) at any time. snapshotDate() restricts hub records to those available at the time of the snapshot; use possibleDates() to see possible snapshot dates.

The location of the local cache can be found (and updated) with getAnnotationHubCache and setAnnotationHubCache; removeCache removes all cache resources.

Arguments

Constructors

: AnnotationHub(..., hub=getAnnotationHubOption("URL"), cache=getAnnotationHubOption("CACHE"), proxy=getAnnotationHubOption("PROXY")): Create an AnnotationHub instance, possibly updating the current database of records.

Accessors

In the code snippets below, x and object are AnnotationHub objects.

hubCache(x): Gets the file system location of the local AnnotationHub cache.

hubUrl(x): Gets the URL for the online hub.

length(x): Get the number of hub records.

names(x): Get the names (AnnotationHub unique identifiers, of the form AH12345) of the hub records.

fileName(x): Get the file path of the hub records as stored in the local cache (AnnotationHub files are stored as unique numbers, of the form 12345). NA is returned for those records which have not been cached.

mcols(x): Get the metadata columns describing each record. Columns include:

title: Record title, frequently the file name of the object.
dataprovider: Original provider of the resource, e.g., Ensembl, UCSC.
species: The species for which the record is most relevant, e.g., ‘Homo sapiens’.
taxonomyid: NCBI taxonomy identifier of the species.
genome: Genome build relevant to the record, e.g., hg19.
description: Textual description of the resource, frequently automatically generated from file path and other information available when the record was created.
tags: Single words added to the record to facilitate identification, e.g,. TCGA, Roadmap.
rdataclass: The class of the R object used to represent the object when imported into R, e.g., GRanges, VCFFile.
sourceurl: Original URL of the resource.
sourectype: Format of the original resource, e.g., BED file.

dbconn(x): Return an open connection to the underyling SQLite database.

dbfile(x): Return the full path the underyling SQLite database.

.db_close(conn): Close the SQLite connection conn returned by dbconn(x).

Subsetting and related operations

In the code snippets below, x is an AnnotationHub object.

x$name: Convenient reference to individual metadata columns, e.g., x$species.

x[i]: Numerical, logical, or character vector (of AnnotationHub names) to subset the hub, e.g., x[x$species == "Homo sapiens"].

x[[i]]: Numerical or character scalar to retrieve (if necessary) and import the resource into R.

query(x, pattern, ignore.case=TRUE, pattern.op= `&`): Return an AnnotationHub subset containing only those elements whose metadata matches pattern. Matching uses pattern as in grepl to search the as.character representation of each column, performing a logical `&` across columns. e.g., query(x, c("Homo sapiens", "hg19", "GTF")).

pattern: A character vector of patterns to search (via grepl) for in any of the mcols() columns.
ignore.case: A logical(1) vector indicating whether the search should ignore case (TRUE) or not (FALSE).
pattern.op: Any function of two arguments, describing how matches across pattern elements are to be combined. The default `&` requires that only records with all elements of pattern in their metadata columns are returned.

subset(x, subset): Return the subset of records containing only those elements whose metadata satisfies the expression in subset. The expression can reference columns of mcols(x), and should return a logical vector of length length(x). e.g.,

subset(x, species == "Homo sapiens" &
        genome="GRCh38")

display(object): Open a web browser allowing for easy selection of hub records via interactive tabular display. Return value is the subset of hub records identified while navigating the display.

recordStatus(hub, record): Returns a data.frame of the record id and status. hub must be a Hub object and record must be a character(1). Can be used to discover why a resource was removed from the hub.

Cache and hub management

In the code snippets below, x is an AnnotationHub object.

: snapshotDate(x) and snapshotDate(x) <- value: Gets or sets the date for the snapshot in use. value should be one of possibleDates().
: possibleDates(x): Lists dates for snapshots that the hub could potentially use.
: cache(x) and cache(x) <- NULL: Adds (downloads) all resources in x, or removes all local resources corresponding to the records in x from the cache. In this case, x would typically be a small subset of AnnotationHub resources.
: hubUrl(x): Gets the URL for the online AnnotationHub.
: hubCache(x): Gets the file system location of the local AnnotationHub cache.
: removeCache(x): Removes local AnnotationHub database and all related resources. After calling this function, the user will have to download any AnnotationHub resources again.
: getAnnotationHubOption(): TODO: Get cache options "CACHE", "URL", "MAXDOWNLOADS" ...
: setAnnotationHubOption(): TODO: Set cache options "CACHE", "URL", "MAXDOWNLOADS" ...

Coercion

In the code snippets below, x is an AnnotationHub object.

: as.list(x): Coerce x to a list of hub instances, one entry per element. Primarily for internal use.
: c(x, ...): Concatenate one or more sub-hub. Sub-hubs must reference the same AnnotationHub instance. Duplicate entries are removed.

Examples

Run this code

  ## create an AnnotationHub object
  library(AnnotationHub)
  ah = AnnotationHub()

  ## Summary of available records
  ah

  ## Detail for a single record
  ah[1]

  ## and what is the date we are using?
  snapshotDate(ah)

  ## how many resources?
  length(ah)

  ## from which resources, is data available?
  head(sort(table(ah$dataprovider), decreasing=TRUE))

  ## from which species, is data available ? 
  head(sort(table(ah$species),decreasing=TRUE)) 

  ## what web service and local cache does this AnnotationHub point to?
  hubUrl(ah)
  hubCache(ah)

  ### Examples ###

  ## One can  search the hub for multiple strings 
  ahs2 <- query(ah, c("GTF", "77","Ensembl", "Homo sapiens"))
  
  ## information about the file can be retrieved using 
  ahs2[1]

  ## one can further extract information from this show method
  ## like the sourceurl using:
  ahs2$sourceurl 
  ahs2$description
  ahs2$title

  ## We can download a file by name like this (using a list semantic):
  gr <- ahs2[[1]]
  ## And we can also extract it by the names like this:
  res <- ah[["AH28812"]]

  ## the gtf file is returned as a GenomicRanges object and contains
  ## data about which organism it belongs to, its seqlevels and seqlengths
  seqinfo(gr) 

  ## each GenomicRanges contains a metadata slot which can be used to get 
  ## the name of the hub object and other associated metadata. 
  metadata(gr) 
  ah[metadata(gr)$AnnotationHubName]
   
  ## And we can also use "[" to restrict the things that are in the
  ## AnnotationHub object (by position, character, or logical vector).
  ## Here is a demo of position:
  subHub <- ah[1:3]

  if(interactive()) {
    ## Display method involves user interaction through web interface
    ah2 <- display(ah)
  }

  ## recordStatus
  recordStatus(ah, "TEST")
  recordStatus(ah, "AH7220")

Run the code above in your browser using DataLab