getURL: Download a URI

Description

These functions download one or more URIs (a.k.a. URLs). It uses libcurl under the hood to perform the request and retrieve the response. There are a myriad of options that can be specified using the … mechanism to control the creation and submission of the request and the processing of the response.

getURLContent has been added as a high-level function like getURL and getBinaryURL but which determines the type of the content being downloaded by looking at the resulting HTTP header's Content-Type field. It uses this to determine whether the bytes are binary or "text".

The request supports any of the facilities within the version of libcurl that was installed. One can examine these via curlVersion.

getURLContent doesn't perform asynchronous or multiple concurrent requests at present.

Usage

getURL(url, ..., .opts = list(),
        write = basicTextGatherer(.mapUnicode = .mapUnicode),
         curl = getCurlHandle(), async = length(url) > 1,
           .encoding = integer(), .mapUnicode = TRUE)
getURI(url, ..., .opts = list(), 
        write = basicTextGatherer(.mapUnicode = .mapUnicode),
         curl = getCurlHandle(), async = length(url) > 1,
          .encoding = integer(), .mapUnicode = TRUE)
getURLContent(url, ..., curl = getCurlHandle(.opts = .opts), .encoding = NA,
               binary = NA, .opts = list(...),
               header = dynCurlReader(curl, binary = binary,
                                        baseURL = url, isHTTP = isHTTP,
                                         encoding = .encoding),
               isHTTP = length(grep('^[[:space:]]*http', url)) > 0)

Arguments

url

a string giving the URI

…

named values that are interpreted as CURL options governing the HTTP request.

.opts

a named list or CURLOptions object identifying the curl options for the handle. This is merged with the values of … to create the actual options for the curl handle in the request.

write

if explicitly supplied, this is a function that is called with a single argument each time the the HTTP response handler has gathered sufficient text. The argument to the function is a single string. The default argument provides both a function for cumulating this text and is then used to retrieve it as the return value for this function.

curl

the previously initialized CURL context/handle which can be used for multiple requests.

async

a logical value that determines whether the download request should be done via asynchronous,concurrent downloading or a serial download. This really only arises when we are trying to download multiple URIs in a single call. There are trade-offs between concurrent and serial downloads, essentially trading CPU cycles for shorter elapsed times. Concurrent downloads reduce the overall time waiting for getURI/getURL to return.

.encoding

an integer or a string that explicitly identifies the encoding of the content that is returned by the HTTP server in its response to our query. The possible strings are ‘UTF-8’ or ‘ISO-8859-1’ and the integers should be specified symbolically as CE_UTF8 and CE_LATIN1. Note that, by default, the package attempts to process the header of the HTTP response to determine the encoding. This argument is used when such information is erroneous and the caller knows the correct encoding. The default value leaves the decision to this default mechanism. This does however currently involve processing each line/chunk of the header (with a call to an R function). As a result, if one knows the encoding for the resulting response, specifying this avoids this slight overhead which is probably quite small relative to network latency and speed.

.mapUnicode

a logical value that controls whether the resulting text is processed to map components of the form \uxxxx to their appropriate Unicode representation.

binary

a logical value indicating whether the caller knows whether the resulting content is binary (TRUE) or not (FALSE) or unknown (NA).

header

this is made available as a parameter of the function to allow callers to construct different readers for processing the header and body of the (HTTP) response. Callers specifying this will typically only adjust the call to dynCurlReader, e.g. to specify a function for its value parameter to control how the body is post-processed.

The caller can specify a value of TRUE or FALSE for this parameter. TRUE means that the header will be returned along with the body; FALSE corresponds to the default and only the body will be returned. When returning the header, it is first parsed via parseHTTPHeader, unless the value of header is of class AsIs. So to get the raw header, pass the argument as header = I(TRUE).

isHTTP

a logical value that indicates whether the request an HTTP request. This is used when determining how to process the response.

Value

If no value is supplied for write, the result is the text that is the HTTP response. (HTTP header information is included if the header option for CURL is set to TRUE and no handler for headerfunction is supplied in the CURL options.)

Alternatively, if a value is supplied for the write parameter, this is returned. This allows the caller to create a handler within the call and get it back. This avoids having to explicitly create and assign it and then call getURL and then access the result. Instead, the 3 steps can be inlined in a single call.

References

Curl homepage http://curl.haxx.se

Examples

Run this code

# NOT RUN {
  omegahatExists = url.exists("http://www.omegahat.net")

   # Regular HTTP
  if(omegahatExists) {
     txt = getURL("http://www.omegahat.net/RCurl/")
        # Then we could parse the result.
     if(require(XML))
         htmlTreeParse(txt, asText = TRUE)
  }


        # HTTPS. First check to see that we have support compiled into
        # libcurl for ssl.
  if(interactive() && ("ssl" %in% names(curlVersion()$features))
         && url.exists("https://sourceforge.net/")) {
     txt = tryCatch(getURL("https://sourceforge.net/"),
                    error = function(e) {
                                  getURL("https://sourceforge.net/",
                                            ssl.verifypeer = FALSE)
                              })

  }


     # Create a CURL handle that we will reuse.
  if(interactive() && omegahatExists) {
     curl = getCurlHandle()
     pages = list()
     for(u in c("http://www.omegahat.net/RCurl/index.html",
                "http://www.omegahat.net/RGtk/index.html")) {
         pages[[u]] = getURL(u, curl = curl)
     }
  }


    # Set additional fields in the header of the HTTP request.
    # verbose option allows us to see that they were included.
  if(omegahatExists)
     getURL("http://www.omegahat.net", httpheader = c(Accept = "text/html", 
                                                      MyField = "Duncan"), 
               verbose = TRUE)



    # Arrange to read the header of the response from the HTTP server as
    # a separate "stream". Then we can break it into name-value
    # pairs. (The first line is the HTTP/1.1 200 Ok or 301 Moved Permanently
    # status line)
  if(omegahatExists) {
     h = basicTextGatherer()
     txt = getURL("http://www.omegahat.net/RCurl/index.html",
                  header= TRUE, headerfunction = h$update, 
                  httpheader = c(Accept="text/html", Test=1), verbose = TRUE) 
     print(paste(h$value(NULL)[-1], collapse=""))
     con <- textConnection(paste(h$value(NULL)[-1], collapse=""))
     read.dcf(con)
     close(con)
  }



   # Test the passwords.
  if(omegahatExists) {
     x = getURL("http://www.omegahat.net/RCurl/testPassword/index.html",  userpwd = "bob:duncantl")

       # Catch an error because no authorization
       # We catch the generic HTTPError, but we could catch the more specific "Unauthorized" error
       # type.
      x = tryCatch(getURLContent("http://www.omegahat.net/RCurl/testPassword/index.html"),
                    HTTPError = function(e) {
                                   cat("HTTP error: ", e$message, "\n")
                                })
  }

# }
# NOT RUN {
  #  Needs specific information from the cookie file on a per user basis
  #  with a registration to the NY times.
  x = getURL("http://www.nytimes.com",
                 header = TRUE, verbose = TRUE,
                 cookiefile = "/home/duncan/Rcookies",
                 netrc = TRUE,
                 maxredirs = as.integer(20),
                 netrc.file = "/home2/duncan/.netrc1",
                 followlocation = TRUE)
# }
# NOT RUN {
   if(interactive() && omegahatExists) {
       d = debugGatherer()
       x = getURL("http://www.omegahat.net", debugfunction = d$update, verbose = TRUE)
       d$value()
   }

    #############################################
    #  Using an option set in R

   if(interactive() && omegahatExists) {
      opts = curlOptions(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE)
      getURL("http://www.omegahat.net/RCurl/testPassword/index.html", verbose = TRUE, .opts = opts)

         # Using options in the CURL handle.
      h = getCurlHandle(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE)
      getURL("http://www.omegahat.net/RCurl/testPassword/index.html", verbose = TRUE, curl = h)
   }



   # Use a C routine as the reader. Currently gives a warning.
  if(interactive() && omegahatExists) {
     routine = getNativeSymbolInfo("R_internalWriteTest", PACKAGE = "RCurl")$address
     getURL("http://www.omegahat.net/RCurl/index.html", writefunction = routine)
  }



  # Example
  if(interactive() && omegahatExists) {
     uris = c("http://www.omegahat.net/RCurl/index.html",
              "http://www.omegahat.net/RCurl/philosophy.xml")
     txt = getURI(uris)
     names(txt)
     nchar(txt)

     txt = getURI(uris, async = FALSE)
     names(txt)
     nchar(txt)


     routine = getNativeSymbolInfo("R_internalWriteTest", PACKAGE = "RCurl")$address
     txt = getURI(uris, write = routine, async = FALSE)
     names(txt)
     nchar(txt)


         # getURLContent() for text and binary
     x = getURLContent("http://www.omegahat.net/RCurl/index.html")
     class(x)

     x = getURLContent("http://www.omegahat.net/RCurl/data.gz")
     class(x)
     attr(x, "Content-Type")

     x = getURLContent("http://www.omegahat.net/Rcartogram/demo.jpg")
     class(x)
     attr(x, "Content-Type")


     curl = getCurlHandle()
     dd = getURLContent("http://www.omegahat.net/RJSONIO/RJSONIO.pdf",
                        curl = curl,
                        header = dynCurlReader(curl, binary = TRUE,
                                           value = function(x) {
                                                    print(attributes(x)) 
                                                    x}))
   }



  # FTP
  # Download the files within a directory.
if(interactive() && url.exists('ftp://ftp.wcc.nrcs.usda.gov')) {

   url = 'ftp://ftp.wcc.nrcs.usda.gov/data/snow/snow_course/table/history/idaho/'
   filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)

      # Deal with newlines as \n or \r\n. (BDR)
      # Or alternatively, instruct libcurl to change \n's to \r\n's for us with crlf = TRUE
      # filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
   filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "")
   con = getCurlHandle( ftp.use.epsv = FALSE)

      # there is a slight possibility that some of the files that are
      # returned in the directory listing and in filenames will disappear
      # when we go back to get them. So we use a try() in the call getURL.
   contents = sapply(filenames[1:5], function(x) try(getURL(x, curl = con)))
   names(contents) = filenames[1:length(contents)]
}
   
# }

Run the code above in your browser using DataLab