extractText

This function extracts text from PDF documents and returns the text as a string,
as a list of lines and as a list of words. It uses 'pdftools' to extract the
content from textual PDF files and 'tesseract' to extract the content from
image-based PDF-files.

Functions for extracting text and tables from
PDF-based order documents. It provides an n-gram-based approach for identifying
the language of an order document. It furthermore uses R-package 'pdftools' to
extract the text from an order document. In the case that the PDF document is
only including an image (because it is scanned document), R package 'tesseract'
is used for OCR. Furthermore, the package provides functionality for identifying
and extracting order position tables in order documents based on a clustering approach.

Michael Scholz

orderanalyzer

Extracting Order Position Tables from PDF-Based Order Documents

Joerg Bauer

extractText function

Arguments

Extracts the text from a PDF file — extractText

extractText: Extracts the text from a PDF file

Description

Usage

Value

Arguments

Examples