getLinkContent: Get main content for corpus items, specified by links.

Description

getLinkContent downloads and extracts content from weblinks for Corpus objects. Typically it is integrated and called as a post-processing function (field:$postFUN) for most WebSource objects. getLinkContent implements content download in chunks which has been proven to be a stabler approach for large content requests.

Usage

getLinkContent(corpus, links = sapply(corpus, meta, "Origin"),
  timeout.request = 30, chunksize = 20, verbose = getOption("verbose"),
  curlOpts = curlOptions(verbose = FALSE, followlocation = TRUE, maxconnects =
  5, maxredirs = 10, timeout = timeout.request, connecttimeout =
  timeout.request, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, useragent =
  "R"), retry.empty = 3, sleep.time = 3, extractor = ArticleExtractor,
  .encoding = integer(), ...)

Arguments

corpus

object of class Corpus for which link content should be downloaded

links

character vector specifyinig links to be used for download, defaults to sapply(corpus, meta, "Origin")

timeout.request

timeout (in seconds) to be used for connections/requests, defaults to 30

curlOpts

curl options to be passed to getURL

chunksize

Size of download chunks to be used for parallel retrieval, defaults to 20

verbose

Specifies if retrieval info should be printed, defaults to getOption("verbose")

retry.empty

Specifies number of times empty content sites should be retried, defaults to 3

sleep.time

Sleep time to be used between chunked download, defaults to 3 (seconds)

extractor

Extractor to be used for content extraction, defaults to extractContentDOM

...

additional parameters to getURL

.encoding

encoding to be used for getURL, defaults to integer() (=autodetect)

Value

corpus including downloaded link content

Description

Usage

Arguments

Value

See Also