Learn R Programming

Rcrawler (version 0.1.5)

Rcrawler: Rcrawler

Description

The crawler's main function, by providing only the website URL and the Xpath patterns to extract this function can crawl the whole website (traverse web pages and collect links) and scrape/extract its contents in an automated manner to produce a structured dataset. The process of a crawling operation is performed by several concurrent processes or nodes in parallel, so it's recommended to use 64bit version of R.

Usage

Rcrawler(Website, no_cores, no_conn, MaxDepth, DIR, RequestsDelay = 0,
  Obeyrobots = FALSE, Useragent, Timeout = 5, URLlenlimit = 255,
  urlExtfilter, urlregexfilter, ignoreUrlParams, KeywordsFilter,
  KeywordsAccuracy, statslinks = FALSE, Encod, ExtractPatterns, PatternsNames,
  ExcludePatterns, ExtractAsText = TRUE, ManyPerPattern = FALSE,
  NetworkData = FALSE)

Arguments

Website

character, the root URL of the website to crawl and scrape.

no_cores

integer, specify the number of clusters (logical cpu) for parallel crawling, by default it's the numbers of available cores.

no_conn

integer, it's the number of concurrent connections per one core, by default it takes the same value of no_cores.

MaxDepth

integer, repsents the max deph level for the crawler, this is not the file depth in a directory structure, but 1+ number of links between this document and root document, default to 10.

DIR

character, correspond to the path of the local repository where all crawled data will be stored ex, "C:/collection" , by default R working directory.

RequestsDelay

integer, The time interval between each round of parallel http requests, in seconds used to avoid overload the website server. default to 0.

Obeyrobots

boolean, if TRUE, the crawler will parse the website\'s robots.txt file and obey its rules allowed and disallowed directories.

Useragent

character, the User-Agent HTTP header that is supplied with any HTTP requests made by this function.it is important to simulate different browser's user-agent to continue crawling without getting banned.

Timeout

integer, the maximum request time, the number of seconds to wait for a response until giving up, in order to prevent wasting time waiting for responses from slow servers or huge pages, default to 5 sec.

URLlenlimit

integer, the maximum URL length limit to crawl, to avoid spider traps; default to 255.

urlExtfilter

character's vector, by default the crawler avoid irrelevant files for data scraping such us xml,js,css,pdf,zip ...etc, it's not recommanded to change the default value until you can provide all the list of filetypes to be escaped.

urlregexfilter

character's vector, filter crawled Urls by regular expression pattern, this is useful when you try to scrape content or index only specific web pages (product pages, post pages).

ignoreUrlParams

character's vector, the list of Url paremeter to be ignored during crawling .

KeywordsFilter

character vector, For users who desires to scrape or collect only web pages that contains some keywords one or more. Rcrawler calculate an accuracy score based of the number of founded keywords. This parameter must be a vector with at least one keyword like c("mykeyword").

KeywordsAccuracy

integer value range bewteen 0 and 100, used only with KeywordsFilter parameter to determine the accuracy of web pages to collect. The web page Accuracy value is calculated using the number of matched keywords and their occurence.

statslinks

boolean, if TRUE, the crawler counts the number of input and output links of each crawled web page.

Encod

character, set the website caharacter encoding, by default the crawler will automatically detect the website defined character encoding.

ExtractPatterns

character's vector, vector of xpath patterns to use for data extraction process.

PatternsNames

character vector, given names for each xpath pattern to extract.

ExcludePatterns

character's vector, vector of xpath patterns to exclude from selected ExtractPatterns.

ExtractAsText

boolean, default is TRUE, HTML and PHP tags is stripped from the extracted piece.

ManyPerPattern

boolean, ManyPerPattern boolean, If False only the first matched element by the pattern is extracted (like in Blogs one page has one article/post and one title). Otherwise if set to True all nodes matching the pattern are extracted (Like in galleries, listing or comments, one page has many elements with the same pattern )

NetworkData

boolean, If set to TRUE, then the crawler map all the internal hyperlink connections within the given website and return DATA for Network construction using igraph or other tools.(two global variables is returned see details)

Value

The crawling and scraping process may take a long time to finish, therefore, to avoid data loss in the case that a function crashes or stopped in the middle of action, some important data are exported at every iteration to R global environement:

- INDEX: A data frame in global environement representing the generic URL index,including the list of fetched URLs and page details (contenttype,HTTP state, number of out-links and in-links, encoding type, and level).

- A repository in workspace that contains all downloaded pages (.html files)

If data scraping is enabled by setting ExtractPatterns parameter:

- DATA: A list of lists in global environement holding scraped contents.

- A csv file 'extracted_contents.csv' holding all extracted data.

If NetworkData is set to TRUE two additional global variables returned by the function are:

- NetwIndex : Vector maps alls hyperlinks (nodes) with a unique integer ID

- NetwEdges : data.frame representing edges of the network, with these column : From, To, Weight (the Depth level where the link connection has been discovered) and Type which actualy has a fixed value.

Details

To start Rcrawler task you need the provide the root URL of the website you want to scrape, it can be a domain, a subdomain or a website section (eg. http://www.domain.com, http://sub.domain.com or http://www.domain.com/section/). The crawler then will go through all its internal links. The process of a crawling is performed by several concurrent processes or nodes in parallel, So, It is recommended to use R 64-bit version.

For complexe charcter content such as arabic execute Sys.setlocale("LC_CTYPE","Arabic_Saudi Arabia.1256") then set the encoding of the web page in Rcrawler function.

If you want to learn more about web scraper/crawler architecture, functional properties and implementation using R language, Follow this link and download the published paper for free .

Link: http://www.sciencedirect.com/science/article/pii/S2352711017300110

Don't forget to cite Rcrawler paper:

Khalil, S., & Fakir, M. (2017). RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.

Examples

Run this code
# NOT RUN {
# }
# NOT RUN {
 Rcrawler(Website ="http://glofile.com/", no_cores = 4, no_conn = 4)

 #Crawl, index, and store web pages using 4 cores and 4 parallel requests

 Rcrawler(Website = "http://glofile.com/", urlregexfilter =  "/[0-9]{4}/[0-9]{2}/",
 ExtractPatterns = c("//*/article","//*/h1"), PatternsNames = c("content","title"))

 #Crawl the website using the default configuration and scrape content matching two XPath
  patterns only from post pages matching a specific regular expression "/[0-9]{4}/[0-9]{2}/".
  Note that the user can use the excludepattern  parameter to exclude a node from being extracted,
  e.g., in the case that a desired node includes (is a parent of) an undesired "child" node.

  Rcrawler(Website = "http://www.example.com/", no_cores=8, no_conn=8, Obeyrobots = TRUE,
  Useragent="Mozilla 3.11")
  # Crawl and index the website using 8 cores and 8 parallel requests with respect to
  robot.txt rules.

  Rcrawler(Website = "http://www.example.com/", no_cores = 4, no_conn = 4,
  urlregexfilter =  "/[0-9]{4}/[0-9]{2}/", DIR = "./myrepo", MaxDepth=3)

 # Crawl the website using  4 cores and 4 parallel requests. However, this will only
  index URLs matching the regular expression pattern (/[0-9]{4}/[0-9]{2}/), and stores pages
  in a custom directory "myrepo".
  The crawler stops After reaching the third level of website depth.

  Rcrawler(Website = "http://www.example.com/", KeywordsFilter = c("keyword1", "keyword2"))
 # Crawl the website and collect only webpages containing keyword1 or keyword2 or both.

  Rcrawler(Website = "http://www.example.com/", KeywordsFilter = c("keyword1", "keyword2"),
   KeywordsAccuracy = 50)
 # Crawl the website and collect only webpages that has an accuracy percentage higher than 50%
  of matching keyword1 and keyword2.

  Rcrawler(Website = "http://glofile.com/" , no_cores = 4, no_conn = 4, GraphData = TRUE)
  # Crawl the entire website, and create network edges DATA of internal links.
  # Using Igraph for exmaple you can plot the network by the following commands
  # library(igraph)
  # network<-graph.data.frame(NetwEdges, directed=T)
  # plot(network)
# }
# NOT RUN {

# }

Run the code above in your browser using DataLab