A function that take a _charachter_ url as input, fetches its html document, and extract all links following a set of rules.
LinkExtractor(url, id, lev, IndexErrPages, Useragent, Timeout = 5,
URLlenlimit = 255, urlExtfilter, statslinks = FALSE, encod, urlbotfiler,
removeparams)character, url to fetch and extract links.
numeric, an id to identify a specific web page in a website collection, it's auto-generated by default
numeric, the depth level of the web page, auto-generated by the Rcrawler function.
character vector, vector of html error code-statut to process, by default it's c(200),eg to include 404 and 403 pages c(404,403)
, default to "Rcrawler"
,default to 5s
interger, the url character length limit to index, default to 255 characters (to avoid spider traps)
character vector, the list of file extensions to exclude from indexing, by dfault a large list is defined (html pages only are permitted) in order to prevent large files downloading; To define your own use c(ext1,ext2,ext3 ...)
boolean, specifies if input and output links shoud be counted, work only when the function is called from the main function scrawler
character, specify the encoding of th web page
character vector , directories/files restricted by robot.txt
character vector, list of url parameters to be removed/ignored
return a list of two elements, the first is a list containing the web page details (url, encoding-type, content-type, content ... etc), the second is a character-vector containing the list of retreived urls.
# NOT RUN {
pageinfo<-LinkExtractor(url="http://www.glofile.com")
# }
Run the code above in your browser using DataLab