epubr

Author: Matthew Leonawicz License: MIT

Read EPUB files in R

Read EPUB text and metadata.

The epubr package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame.

E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata.

EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with epubr.

Text is read ‘as is’ for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user’s discretion, such as with functions from packages like tm or qdap.

Installation

Install epubr from CRAN with:

install.packages("epubr")

Install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("ropensci/epubr")

Example

Bram Stoker’s Dracula novel sourced from Project Gutenberg is a good example of an EPUB file with unfortunate formatting. The first thing that stands out is the naming convention using item followed by some ordered digits does not differentiate sections like the book preamble from the chapters. The numbering also starts in a weird place. But it is actually worse than this. Notice that sections are not broken into chapters; they can begin and end in the middle of chapters!

These annoyances aside, the metadata and contents can still be read into a convenient table. Text mining analyses can still be performed on the overall book, if not so easily on individual chapters. See the package vignette for examples on how to further improve the structure of an e-book with formatting like this.

file <- system.file("dracula.epub", package = "epubr")
(x <- epub(file))
#> # A tibble: 1 x 9
#>   rights                    identifier                          creator     title   language subject      date       source                                             data            
#>   <chr>                     <chr>                               <chr>       <chr>   <chr>    <chr>        <chr>      <chr>                                              <list>          
#> 1 Public domain in the USA. http://www.gutenberg.org/ebooks/345 Bram Stoker Dracula en       Horror tales 1995-10-01 http://www.gutenberg.org/files/345/345-h/345-h.htm <tibble [15 x 4~

x$data[[1]]
#> # A tibble: 15 x 4
#>    section          text                                                                                                                                                     nword nchar
#>    <chr>            <chr>                                                                                                                                                    <int> <int>
#>  1 item6            "The Project Gutenberg EBook of Dracula, by Bram StokerThis eBook is for the use of anyone anywhere at no cost and withalmost no restrictions whatsoeve~ 11446 60972
#>  2 item7            "But I am not in heart to describe beauty, for when I had seen the view I explored further; doors, doors, doors everywhere, and all locked and bolted. ~ 13879 71798
#>  3 item8            "\" 'Lucy, you are an honest-hearted girl, I know. I should not be here speaking to you as I am now if I did not believe you clean grit, right through ~ 12474 65522
#>  4 item9            "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day, 11 o'clock p. m.-Oh, but I am tired! If it were not that I had made my diary a duty I should not open it ~ 12177 62724
#>  5 item10           "CHAPTER X\nLetter, Dr. Seward to Hon. Arthur Holmwood.\n\"6 September.\n\"My dear Art,-\n\"My news to-day is not so good. Lucy this morning had gone b~ 12806 66678
#>  6 item11           "Once again we went through that ghastly operation. I have not the heart to go through with the details. Lucy had got a terrible shock and it told on h~ 12103 62949
#>  7 item12           "CHAPTER XIVMINA HARKER'S JOURNAL\n23 September.-Jonathan is better after a bad night. I am so glad that he has plenty of work to do, for that keeps hi~ 12214 62234
#>  8 item13           "CHAPTER XVIDR. SEWARD'S DIARY-continued\nIT was just a quarter before twelve o'clock when we got into the churchyard over the low wall. The night was ~ 13990 72903
#>  9 item14           "\"Thus when we find the habitation of this man-that-was, we can confine him to his coffin and destroy him, if we obey what we know. But he is clever. ~ 13356 69779
#> 10 item15           "\"I see,\" I said. \"You want big things that you can make your teeth meet in? How would you like to breakfast on elephant?\"\n\"What ridiculous nonse~ 12866 66921
#> 11 item16           "CHAPTER XXIIIDR. SEWARD'S DIARY\n3 October.-The time seemed terrible long whilst we were waiting for the coming of Godalming and Quincey Morris. The P~ 11928 61550
#> 12 item17           "CHAPTER XXVDR. SEWARD'S DIARY\n11 October, Evening.-Jonathan Harker has asked me to note this, as he says he is hardly equal to the task, and he wants~ 13119 68564
#> 13 item18           " \nLater.-Dr. Van Helsing has returned. He has got the carriage and horses; we are to have some dinner, and to start in an hour. The landlady is putti~  8435 43464
#> 14 item19           "End of the Project Gutenberg EBook of Dracula, by Bram Stoker*** END OF THIS PROJECT GUTENBERG EBOOK DRACULA ******** This file should be named 345-h.~  2665 18541
#> 15 coverpage-wrapp~ ""                                                                                                                                                           0     0

Related packages

tesseract by @jeroen for more direct control of the OCR process.

pdftools for extracting metadata and text from PDF files (therefore more specific to PDF, and without a Java dependency)

tabulizer by @leeper and @tpaskhalis, Bindings for Tabula PDF Table Extractor Library, to extract tables, therefore not text, from PDF files.

rtika by @goodmansasha for more general text parsing.

gutenbergr by @dgrtwo for searching and downloading public domain texts from Project Gutenberg.


Please note that the epubr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Down Chevron

Install

install.packages('epubr')

Monthly Downloads

596

Version

0.6.2

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Last Published

February 20th, 2021

Functions in epubr (0.6.2)