A 'Shiny' App for Exploration of Text Collections
Facilitates dynamic exploration of text collections through an
intuitive graphical user interface and the power of regular expressions.
The package contains 1) a helper function to convert a data frame to a
'corporaexplorerobject', 2) a 'Shiny' app for fast and flexible exploration
of a 'corporaexplorerobject', and 3) a 'Shiny' app for simple
retrieval/extraction of documents from a 'corporaexplorerobject' in a
reading-friendly format. The package also includes demo apps with which
one can explore Jane Austen's novels and the State of the Union Addresses
(data from the 'janeaustenr' and 'sotu' packages respectively).
corporaexplorer: An R package for dynamic exploration of text collections
“I really like the application and its simplicity. It looks great and is very functional. … a nice addition to text analysis tools.”
–Kenneth Benoit, creator of quanteda, professor of computational social science at LSE
– Featured in RStudio’s “R Views” blog’s “Top 40 New R Packages” for September 2019
What is corporaexplorer?
corporaexplorer is an R package that uses the
Shiny graphical user
interface framework for dynamic exploration of text collections.
corporaexplorer is designed for use with a wide range of text collections; one example could be a collection of tens of thousands of documents scraped from a governmental website; another example could be the collected works of a novelist; a third example could be the chapters of a single book.
corporaexplorer’s intended primary audience are qualitatively oriented researchers who rely on close reading of textual documents as part of their academic activity, but the package should also be a useful supplement for those doing quantitative textual research and wishing to visit the texts under study. Finally, by offering a convenient way to explore any character vector, it can also be useful for a wide range of other R users.
While collecting and preparing the text collections to be explored requires some familiarity with R programming, using the Shiny apps for exploring and extracting documents from the corpus should be fairly intuitive also for those with no programming knowledge, once the apps have been set up by a collaborator. Thus, the aim is for the package to be useful for anyone with a rudimentary knowledge of R – or with collaborators who have such knowledge.
To install the released version from CRAN, simply run the following from an R console:
Alternatively, to install the development version from GitHub, run the following from an R console:
corporaexplorer works on Mac OS, Windows and Linux. (The Shiny apps look much clunkier on Windows than on the other platforms, but the apps are fully functional.)
How to cite
Please cite the following paper if you use corporaexplorer in your research.
Gjerde, Kristian Lundby. 2019. “corporaexplorer: An R package for dynamic exploration of text collections.” Journal of Open Source Software 4 (38): 1342. https://doi.org/10.21105/joss.01342.
For a BibTeX entry, use the output from
For usage instructions and example corpora, see the package web page.
The package includes two demo apps.
To explore Jane Austen’s novels (data accessed through the janeaustenr package):
To explore the US presidents’ State of the Union addresses (data accessed through the the sotu package):
For more info, see https://kgjerde.github.io/corporaexplorer/articles/jane_austen.html and https://kgjerde.github.io/corporaexplorer/articles/sotu.html, and also the function references.
A note on platforms and encoding
corporaexplorer works on Mac OS, Windows and Linux, and there are some important differences in how R handles text on the different platforms. If you are working with plain English text, there will most likely be no issues with encoding on any platform. Unfortunately, working with non-ASCII encoded text in R (e.g. non-English characters), can be complicated – in particular on Windows.
On Mac OS or Linux, problems with encoding will likely not arise at
all. If problems do arise, they can typically be solved by making the R
“locale” unicode-friendly (e.g.
"en_US.UTF-8")). NB! This assumes that the text is UTF-8 encoded, so
if changing the locale in this way does not help, make sure that the
text is encoded as UTF-8 characters. Alternatively, if you can ascertain
the character encoding, set the locale correspondingly.
On Windows, things can be much more complicated. The most important
thing is to check carefully that the texts appear as expected in
corporaexplorer’s apps, and that the searches function as expected. If
there are problems, a good place to start is a blog post with the
telling title “Escaping from character encoding hell in R on
For (a lot) more information about encoding, see this informative article by David C. Zentgraf.
Contributions in the form of feedback, bug reports and code are most welcome. Ways to contribute:
Functions in corporaexplorer
|create_sotu_df||Create a data frame with State of the Union texts and metadata|
|test_data||A tiny test dataset to test basic functionality|
|get_df||Retrieve the document data frame from a corporaexplorerobject|
|transform_365||Convert "data_dok" tibble to "data_365" tibble|
|prepare_data||Prepare data for corpus exploration|
|get_matrix||Split up returned list from matrix_via_r()|
|run_document_extractor||Launch Shiny app for retrieval of documents from text collection|
|demo_sotu||Demo apps: State of the Union addresses|
|explore||Launch Shiny app for exploration of text collection|
|get_term_vector||Split up returned list from matrix_via_r()|
|demo_jane_austen||Demo app: Jane Austen's novels|
|corporaexplorer-deprecated||Deprecated functions in package corporaexplorer|
|include_columns_for_ui_checkboxes||Values for custom UI sidebar checkbox filtering|
|transform_regular||Adjusts data frame to corporaexplorer format|
|matrix_via_r||Create document term matrix for fast search of single words|
Last month downloads
|License||GPL-3 | file LICENSE|
|Date/Publication||2020-02-07 17:50:02 UTC|
|Packaged||2020-02-07 08:49:49 UTC; Kristian|
Include our badge in your README