extractTables: Extract tables from a given words-dataframe

Description

This function extracts order-position-tables from PDF-based order documents. It tries to identify table rows based on a clustering approach and thereafter identifies the column structure. A table row can consist of multiple text rows and the text rows can span different columns. This function furthermore tries to identify the meaning of the columns (position, articleID, description, quantity, quanity unit, unit price, total price, currency, date).

Usage

extractTables(text, minCols = 3, maxDistance = 20, entityNames = NA)

Value

List of lists describing the tables. Each sublist includes a data frame (data) which is the identified table, the position of text lines that constitute the table and the position of the significant lines.

Arguments

text: List including several representations of text extracted from a PDF file. This list is generated by the function extractText.
minCols: Number of columns a table must minimal consist of
maxDistance: Number of text lines that can maximally exist between the start of two table rows
entityNames: A list of four name vectors (currencyUnits, quantityUnits, headerNames, noTableNames). Each vector contains strings that correspond to currency units, quantity units, header names or names of entities not being a table.

Examples

Run this code

file <- system.file("extdata", "OrderDocument_en.pdf", package = "orderanalyzer")
text <- extractText(file)

# Extracting order tables without any further information
tables <- extractTables(text)
tables[[1]]$data

# Extracting order tables with further information
tables <- extractTables(text,
  entityNames = list(currencyUnits = enc2utf8(c("eur", "euro", "\u20AC")),
                     quantityUnits = enc2utf8(c("pcs", "pcs.")),
                     headerNames = enc2utf8(c("pos", "item", "quantity")),
                     noTableNames = enc2utf8(c("order total", "supplier number")))
)
tables[[1]]$data

# Extracting order tables from a German document
file <- system.file("extdata", "OrderDocument_de.pdf", package = "orderanalyzer")
text <- extractText(file)
tables <- extractTables(text)
tables[[1]]$data

Run the code above in your browser using DataLab