LinkExtractor: LinkExtractor

Description

A function that take a _charachter_ url as input, fetches its html document, and extract all links following a set of rules.

Usage

LinkExtractor(url, id, lev, IndexErrPages, Useragent, Timeout = 5,
  URLlenlimit = 255, urlExtfilter, statslinks = FALSE, encod, urlbotfiler,
  removeparams)

Arguments

url

character, url to fetch and extract links.

numeric, an id to identify a specific web page in a website collection, it's auto-generated by default

lev

numeric, the depth level of the web page, auto-generated by the Rcrawler function.

IndexErrPages

character vector, vector of html error code-statut to process, by default it's c(200),eg to include 404 and 403 pages c(404,403)

Useragent

, default to "Rcrawler"

Timeout

,default to 5s

URLlenlimit

interger, the url character length limit to index, default to 255 characters (to avoid spider traps)

urlExtfilter

character vector, the list of file extensions to exclude from indexing, by dfault a large list is defined (html pages only are permitted) in order to prevent large files downloading; To define your own use c(ext1,ext2,ext3 ...)

statslinks

boolean, specifies if input and output links shoud be counted, work only when the function is called from the main function scrawler

encod

character, specify the encoding of th web page

urlbotfiler

character vector , directories/files restricted by robot.txt

removeparams

character vector, list of url parameters to be removed/ignored

Value

return a list of two elements, the first is a list containing the web page details (url, encoding-type, content-type, content ... etc), the second is a character-vector containing the list of retreived urls.

Examples

Run this code

# NOT RUN {
pageinfo<-LinkExtractor(url="http://www.glofile.com")

# }

Run the code above in your browser using DataLab