Learn R Programming

daiR: OCR with Google Document AI in R

daiR is an R package for Google Document AI, a powerful server-based OCR service with support for over 60 languages. The package provides an interface for the Document AI API and comes with additional tools for output file parsing and text reconstruction. See the daiR website and this journal article for more details.

Use

Quick OCR short documents:

## NOT RUN
library(daiR)
get_text(dai_sync("file.pdf"))

Turn images of tables into R dataframes:

## NOT RUN:
# Assumes a default processor of type "FORM_PARSER_PROCESSOR"
get_tables(dai_sync("file.pdf"))

Draw bounding boxes on the source image:

## NOT RUN:
draw_blocks(dai_sync("file.pdf"))

Requirements

Google Document AI is a paid service that requires a Google Cloud account and a Google Storage bucket. I recommend using Mark Edmondson's googleCloudStorageR package in combination with daiR.

Installation

Install daiR from CRAN:

install.packages("daiR")

Or install the latest development version from Github:

devtools::install_github("hegghammer/daiR")

Citation

To cite daiR in publications, please use

Hegghammer, T., (2021). daiR: an R package for OCR with Google Document AI. Journal of Open Source Software, 6(68), 3538, https://doi.org/10.21105/joss.03538

Bibtex:

@article{Hegghammer2021,
  doi = {10.21105/joss.03538},
  url = {https://doi.org/10.21105/joss.03538},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {68},
  pages = {3538},
  author = {Thomas Hegghammer},
  title = {daiR: an R package for OCR with Google Document AI},
  journal = {Journal of Open Source Software}
}

Acknowledgments

Thanks to Mark Edmondson, Hallvar Gisnås, Will Hanley, Neil Ketchley, Trond Arne Sørby, Chris Barrie, and Geraint Palmer for contributions to the project.

Code of conduct

Please note that the daiR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('daiR')

Monthly Downloads

553

Version

1.0.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Thomas Hegghammer

Last Published

February 12th, 2024

Functions in daiR (1.0.0)

get_tables

Get tables
get_project_id

Get project id
defunct

Defunct functions
get_processor_versions

List available versions of processor
get_processors

List created processors
merge_shards

Merge shards
text_from_dai_file

Get text from output file
pdf_to_binbase

PDF to base64 tiff
get_text

Get text
img_to_binbase

Image to base64 tiff
is_colour

Check that a string is a valid colour representation
image_to_pdf

Convert images to PDF
text_from_dai_response

Get text from HTTP response object
draw_blocks

Draw block bounding boxes
.onAttach

Run when daiR is attached
delete_processor

Delete processor
draw_paragraphs

Draw paragraph bounding boxes
get_entities

Get entities
get_processor_info

Get information about processor
reassign_tokens

Assign tokens to new blocks
enable_processor

Enable processor
draw_tokens

Draw token bounding boxes
from_labelme

Extract block coordinates from labelme files
make_hocr

Make hOCR file
list_processor_types

List available processor types
tables_from_dai_response

Get tables from response object
tables_from_dai_file

Get tables from output file
reassign_tokens2

Assign tokens to a single new block
redraw_blocks

Inspect revised block bounding boxes
split_block

Split a block bounding box
is_json

Check that a file is JSON
is_pdf

Check that a file is PDF
dai_sync

OCR document synchronously
dai_status

Check job status
build_token_df

Build token dataframe
build_block_df

Build block dataframe
dai_auth

Check authentication
dai_notify

Notify on job completion
deprecated

Deprecated functions
disable_processor

Disable processor
draw_entities

Draw entity bounding boxes
create_processor

Create processor
dai_async

OCR documents asynchronously
draw_lines

Draw line bounding boxes
dai_user

Get user information
dai_token

Produce access token