On a case by case basis, you can always select columns and filter rows of a resulting data frame for a single e-book subsequent to visual inspection.
However, the optional arguments fields
, drop_sections
and chapter_pattern
allow you to do some of this as part of the EPUB file reading process.
You can ignore these arguments and do all your own post-processing of the resulting data frame, but if using these arguments,
they are most likely to be useful for bulk e-book processing where file
is a vector of like-formatted files.
Main columns
The fields
argument can be used to limit the columns returned in the primary data frame.
E.g., fields = c("title", "creator", "date", "identifier", "publisher", "file")
. Some fields will be returned even if not in fields
, such as data
and title
.
Ideally, you should already know what metadata fields are in the EPUB file. This is not possible for large collections with possibly different formatting.
Note that when "file"
is included in fields
, the output will include a column of the original file names, in case this is different from the content of a source
field that may be present in the metadata.
So this field is always available even if not part of the file metadata.
Additionally, if there is no title
field in the metadata, the output data frame will include a title
column filled in with the same file names,
unless you pass the additional optional title argument, e.g. title = "TitleFieldID"
so that another field can me mapped to title
.
If supplying a title
argument that also does not match an existing field in the e-book, the output title
column will again default to file names.
File names are the fallback option because unlike e-book metadata fields, file names always exist and should also always be unique when performing vectorized reads over multiple books,
ensuring that title
can be a column in the output data frame that uniquely identifies different e-books even if the books did not have a title
field in their metadata.
Columns of the nested data frames in data
are fixed. Select from these in subsequent data frame manipulations.
Nested rows
The chapter_pattern
argument may be helpful for bulk processing of similarly formatted EPUB files. This should be ignored for poorly formatted EPUB files or where there is inconsistent naming across an e-book collection.
Like with fields
, you should explore file metadata in advance or this argument will not be useful. If provided, a column nchap
is added to the output data frame giving the guessed number of chapters.
In the data
column, the section
column of the nested data frames will also be updated to reflect guessed chapters with new, consistent chapter IDs, always beginning with ch
and ending with digits.
The drop_sections
argument also uses regular expression pattern matching like chapter_pattern
and operates on the same section
column. It simply filters out any matched rows.
This is useful for dropping rows that may pertain to book cover, copyright and acknowledgements pages, and other similar, clearly non-chapter text, e-book sections.
An example that might work for many books could be drop_sections = "^co(v|p)|^ack"
Rows of the primary data frame are fixed. Filter or otherwise manipulate these in subsequent data frame manipulations. There is one row per file so filtering does not make sense to do as part of the initial file reading.
Unzipping EPUB files
If using epub_unzip
directly on individual EPUB files, this gives you control over where to extract archive files to and what to do with them subsequently.
epub
and epub_meta
use epub_unzip
internally to extract EPUB archive files to the R session temp directory (with tempdir()
).
You do not need to use epub_unzip
directly prior to using these other functions. It is only needed if you want the internal files for some other purpose in or out of R.