Learn R Programming

⚠️There's a newer version (0.7.15) of this package.Take me there.

README

Peter Meißner
r Sys.time()

Status: Feature complete and part of the ROpenSci network.

Author: Peter Meißner

Contributer: Oliver Keys (code review and improvements), Rich FitzJohn (code review and improvements)

Licence: MIT

Description:

The robotstxt package provides functions to download and parse robots.txt files. Ultimatly the package makes it easy to check if bots (spiders, scrapers, ...) are allowed to access specific resources on a domain.

Installation and start - stable version

install.packages("robotstxt")
library(robotstxt)

Installation and start - development version

devtools::install_github("petermeissner/robotstxt")
library(robotstxt)

Robotstxt class documentation

?robotstxt

Usage

library(robotstxt)

paths_allowed(
  paths  = c("/api/rest_v1/?doc", "/w/"), 
  domain = "wikipedia.org", 
  bot    = "*"
)
## [1]  TRUE FALSE
paths_allowed(
  paths = c(
    "https://wikipedia.org/api/rest_v1/?doc", 
    "https://wikipedia.org/w/"
  )
)
## [1]  TRUE FALSE

... or use it that way ...

library(robotstxt)

rtxt <- robotstxt(domain = "wikipedia.org")
rtxt$check(paths = c("/api/rest_v1/?doc", "/w/"), bot= "*")
## /api/rest_v1/?doc               /w/ 
##              TRUE             FALSE

More information

vignette

Contribution - AKA The-Think-Twice-Be-Nice-Rule

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms:

As contributors and maintainers of this project, we pledge to respect all people who

contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.

We are committed to making participation in this project a harassment-free experience for

everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.

Examples of unacceptable behavior by participants include the use of sexual language or

imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.

Project maintainers have the right and responsibility to remove, edit, or reject comments,

commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by

opening an issue or contacting one or more of the project maintainers.

This Code of Conduct is adapted from the Contributor Covenant

(http:contributor-covenant.org), version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/

Copy Link

Version

Install

install.packages('robotstxt')

Monthly Downloads

2,706

Version

0.3.2

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Peter Meissner

Last Published

December 5th, 2016

Functions in robotstxt (0.3.2)

sanitize_path

making paths uniform
sanitize_permission_values

transforming permissions into regular expressions (values)
sanitize_permissions

transforming permissions into regular expressions (whole permission)
rt_get_rtxt

load robots.txt files saved along with the package
rt_get_useragent

extracting HTTP useragents from robots.txt
print.robotstxt_text

printing robotstxt_text
rt_get_fields

extracting permissions from robots.txt
rt_get_fields_worker

extracting robotstxt fields
robotstxt

Generate a representations of a robots.txt file
parse_robotstxt

function parsing robots.txt
paths_allowed

check if a bot has permissions to access page(s)
rt_list_rtxt

list robots.txt files saved along with the package
guess_domain

function guessing domain from path
remove_domain

function to remove domain from path
rt_get_comments

extrcting comments from robots.txt
named_list

make automatically named list
path_allowed

check if a bot has permissions to access page
get_robotstxt

downloading robots.txt file
rt_cache

get_robotstxt() cache
print.robotstxt

printing robotstxt