Learn R Programming

htmltab: Hassle-free HTML tables in R

HTML tables are a valuable data source but extracting and recasting these data into a useful format can be tedious. htmltab is a package for extracting structured information from HTML tables. It is similar to readHTMLTable() of the XML package but provides two major advantages:

  1. First, the function automatically expands row and column spans in the header and body cells.
  2. Second, users are given more control over the identification of header and body rows which will end up in the R table.

Additionally, the function preprocesses table code, removes unneeded parts and so helps to alleviate the need for tedious post-processing.

Installation

You can install the released version of htmltab from CRAN with:

install.packages("htmltab")

And the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("htmltab/htmltab")

Usage

To see htmltab in action, take a look at the case studies in this blog post, the package vignette, or the package manual.

Copy Link

Version

Install

install.packages('htmltab')

Monthly Downloads

12

Version

0.8.2

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Gerhard Burger

Last Published

September 16th, 2021

Functions in htmltab (0.8.2)

eval_body

Evaluate and deparse the body argument
htmltab

Assemble a data frame from HTML table data
identify_elements

Assemble XPath expressions for header and body
get_header_elements

Extracts header elements
rm_empty_cols

Remove columns which do not have data values
check_type

Produce the table node
create_inbody

Reshape in table header information into wide format
get_span

Extracts rowspan information
rm_nuisance

Remove nuisance elements from the the table code
get_trindex

Return table row index given an XPath
rm_empty_rows

Remove rows which do not have data values
get_head_xpath

Return header XPath
get_cell_element

Extracts cells elements
get_body_xpath

Return body XPath
select_tab

Selects the table from the HTML Code
normalize_tr

Normalizes rows to be nested in tr tags, header in thead, body in tbody and numbers them
num_xpath

Generate numeric XPath expression
eval_header

Evaluate and deparse the header argument