Rcrawler: Rcrawler

Description

The crawler's main function, by providing only the website URL and the Xpath patterns to extract this function can crawl the whole website (traverse web pages and collect links) and scrape/extract its contents in an automated manner to produce a structured dataset. The process of a crawling operation is performed by several concurrent processes or nodes in parallel, so it's recommended to use 64bit version of R.

Usage

Rcrawler(Website, no_cores, no_conn, MaxDepth, DIR, RequestsDelay = 0,
  duplicatedetect = FALSE, Obeyrobots = FALSE, Useragent, Timeout = 5,
  URLlenlimit = 255, urlExtfilter, urlregexfilter, ignoreUrlParams,
  statslinks = FALSE, Encod, ExtractPatterns, PatternsNames, ExcludePatterns,
  ExtractAsText = TRUE)

Arguments

Website

character, the root URL of the website to crawl and scrape.

no_cores

integer, specify the number of clusters (logical cpu) for parallel crawling, by default it's the numbers of available cores.

no_conn

integer, it's the number of concurrent connections per one core, by default it takes the same value of no_cores.

MaxDepth

integer, repsents the max deph level for the crawler, this is not the file depth in a directory structure, but 1+ number of links between this document and root document, default to 10.

DIR

character, correspond to the path of the local repository where all crawled data will be stored ex, "C:/collection" , by default R working directory.

RequestsDelay

integer, The time interval between each round of parallel http requests, in seconds used to avoid overload the website server. default to 0.

duplicatedetect

boolean, if true the crawler performs a near duplicate detection using SimHash algorithm to ignore documents that has been scraped.

Obeyrobots

boolean, if TRUE, the crawler will parse the website\'s robots.txt file and obey its rules allowed and disallowed directories.

Useragent

character, the User-Agent HTTP header that is supplied with any HTTP requests made by this function.it is important to simulate different browser's user-agent to continue crawling without getting banned.

Timeout

integer, the maximum request time, the number of seconds to wait for a response until giving up, in order to prevent wasting time waiting for responses from slow servers or huge pages, default to 5 sec.

URLlenlimit

integer, the maximum URL length limit to crawl, to avoid spider traps; default to 255.

urlExtfilter

character's vector, by default the crawler avoid irrelevant files for data scraping such us xml,js,css,pdf,zip ...etc, it's not recommanded to change the default value until you can provide all the list of filetypes to be escaped.

urlregexfilter

character's vector, filter crawled Urls by regular expression pattern, this is useful when you try to scrape content or index only specific web pages (product pages, post pages).

ignoreUrlParams

character's vector, the list of Url paremeter to be ignored during crawling .

statslinks

boolean, if TRUE, the crawler counts the number of input and output links of each crawled web page.

Encod

character, set the website caharacter encoding, by default the crawler will automatically detect the website defined character encoding.

ExtractPatterns

character's vector, vector of xpath patterns to use for data extraction process.

PatternsNames

character vector, given names for each xpath pattern to extract.

ExcludePatterns

character's vector, vector of xpath patterns to exclude from selected ExtractPatterns.

ExtractAsText

boolean, default is TRUE, HTML and PHP tags is stripped from the extracted piece.

Value

The crawling and scraping process may take a long time to finish, therefore, to avoid data loss in the case that a function crashes or stopped in the middle of action, some important data are exported at every iteration to R global environement:

- INDEX: A data frame in global environement representing the generic URL index,including the list of fetched URLs and page details (contenttype,HTTP state, number of out-links and in-links, encoding type, and level).

- A repository in workspace that contains all downloaded pages (.html files)

In addition, if data scraping is enabled :

- DATA: A vector in global environement contains scraped contents.

- A csv file 'extracted_contents.csv' holding all extracted data.

Details

To start Rcrawler task you need the provide the root URL of the website you want to scrape, it can be a domain, a subdomain or a website section (eg. http://www.domain.com, http://sub.domain.com or http://www.domain.com/section/). The crawler then will go through all its internal links. The process of a crawling is performed by several concurrent processes or nodes in parallel, So, It is recommended to use R 64-bit version.

For complexe charcter content such as arabic execute Sys.setlocale("LC_CTYPE","Arabic_Saudi Arabia.1256") then set the encoding of the web page in Rcrawler function.

If you want to learn more about web scraper/crawler architecture, functional properties and implementation using R language, Follow this link and download the published paper for free .

Link: http://www.sciencedirect.com/science/article/pii/S2352711017300110

Don't forget to cite Rcrawler paper:

Khalil, S., & Fakir, M. (2017). RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
 Rcrawler(Website ="http://glofile.com/", no_cores = 4, no_conn = 4)

 #Crawl, index, and store web pages using 4 cores and 4 parallel requests

 Rcrawler(Website = "http://glofile.com/", urlregexfilter =  "/[0-9]{4}/[0-9]{2}/",
 ExtractPatterns = c("//*/article","//*/h1"), PatternsNames = c("content","title"))

 #Crawl the website using the default configuration and scrape content matching two XPath
  patterns only from post pages matching a specific regular expression "/[0-9]{4}/[0-9]{2}/".
  Note that the user can use the excludepattern  parameter to exclude a node from being extracted,
  e.g., in the case that a desired node includes (is a parent of) an undesired "child" node.

  Rcrawler(Website = "http://www.example.com/", no_cores=8, no_conn=8, Obeyrobots = TRUE,
  Useragent="Mozilla 3.11")
  # Crawl and index the website using 8 cores and 8 parallel requests with respect to
  robot.txt rules.

  Rcrawler(Website = "http://www.example.com/", no_cores = 4, no_conn = 4,
  urlregexfilter =  "/[0-9]{4}/[0-9]{2}/", DIR = "./myrepo", MaxDepth=3)

 # Crawl the website using  4 cores and 4 parallel requests. However, this will only
  index URLs matching the regular expression pattern (/[0-9]{4}/[0-9]{2}/), and stores pages
  in a custom directory "myrepo". The crawler stops when it reaches the third level.
# }
# NOT RUN {

# }

Run the code above in your browser using DataLab