CorporaCoCo
The package implements the method introduced in Wiegand and Hennessey et al. (2017a). It identifies significant difference in co-occurrence counts for a given node or set of nodes across two corpora, using a Fisher’s Exact test.
A good place to start is the ‘Introduction to CorporaCoCo’ vignette. You can open the vignette with vignette("intro", package = "CorporaCoCo"). For a list of all documentation use library(help="CorporaCoCo"). For updates on development versions of the package and documentation, please watch this GitHub page.
References
Wiegand, V., Hennessey, A., Tench, C. R., & Mahlberg, M. (2017a, May 24). Comparing co-occurrences between corpora. 38th ICAME conference, Charles University, Prague.
Wiegand, V., Hennessey, A., Tench, C. R., & Mahlberg, M. (2017b, July 24). A cookbook of co-occurrence comparison techniques and how they relate to the subtleties in your research question. 9th International Corpus Linguistics Conference, University of Birmingham, Birmingham.
A very simple example of usage
This example takes the two Dickens novels 'Great Expectations' and 'A Tale of Two Cities' and compares the co-occurrences of a set of body part nouns. The idea is that since body part nouns are common in suspensions the statistically significant co-occurrence differences should include personal pronouns reflecting the differing narrative voices of the texts. We use the texts here only as sample data; we retrieve them from the CLiC API using the clicclient package (the other functions of that package are still under development; please see the clicclient GitHub page for details).
#devtools::install_github("mahlberg-lab/clicclient")
library(clicclient)
# retrieve texts for 'A Tale of two Cities' (TTC) and 'Great Expectations' from the CLiC corpora using the 'clicclient' package
TTC <- clic_texts("TTC")
GE <- clic_texts("GE")
# tokenize / create corp_text objects
TTC_text <- corp_text(TTC)
GE_text <- corp_text(GE)
# count co-occurrences / create corp_surface objects
TTC_cooccurs <- corp_surface(TTC_text, span = "5LR")
GE_cooccurs <- corp_surface(GE_text, span = "5LR")
# set the body part nodes
nodes <- c('back', 'eye', 'eyes', 'forehead', 'hand', 'hands', 'head', 'shoulder')
# run co-occurrence comparison with corp_coco
results <- corp_coco(TTC_cooccurs, GE_cooccurs, nodes = nodes, fdr = 0.01)
results
## x y H_A M_A H_B M_B effect_size CI_lower CI_upper p_value p_adjusted
## 1: back me 3 1347 49 2391 3.2014513 1.565327 5.5296252 5.771195e-07 5.638457e-04
## 2: back my 1 1349 31 2409 4.1171190 1.528647 9.4632893 1.931898e-05 9.437321e-03
## 3: eyes i 10 1640 53 1747 2.3142117 1.318110 3.4597915 1.320717e-07 6.114918e-05
## 4: eyes joe 0 1650 16 1784 Inf 1.831254 Inf 3.641000e-05 7.000398e-03
## 5: eyes me 3 1647 25 1775 2.9502767 1.233703 5.3219224 3.779912e-05 7.000398e-03
## 6: eyes my 5 1645 58 1742 3.4522796 2.142760 5.1326458 1.058629e-11 9.802907e-09
## 7: eyes the 122 1528 62 1738 -1.1620034 -1.638598 -0.6973805 2.839285e-07 8.763925e-05
## 8: hand his 175 2315 114 2586 -0.7778652 -1.140960 -0.4192322 1.183505e-05 4.540716e-03
## 9: hand i 19 2471 75 2625 1.8932508 1.146902 2.7069140 2.704759e-08 1.556589e-05
## 10: hand my 12 2478 86 2614 2.7637906 1.880930 3.7755453 5.884107e-14 6.772608e-11
## 11: hands my 5 1125 45 1775 2.5113109 1.177037 4.2063750 1.127123e-05 9.321308e-03
## 12: head my 10 1760 62 2258 2.2723508 1.292764 3.4061853 1.024855e-07 1.111968e-04
plot(results)Further examples of how the method has been used can be found in:
Mahlberg, M., Wiegand, V., & Hennessey, A. (2020). Eye language – body part collocations and textual contexts in the nineteenth-century novel. In L. Fesenmeier & I. Novakova (Eds.), Phraseology and Stylistics of Literary Language/Phraséologie et Stylistique de la Langue Littéraire (pp. 143–176). Peter Lang. https://www.academia.edu/45152494/Eye_language_body_part_collocations_and_textual_contexts_in_the_nineteenth_century_novel
Wiegand, V. (2019). A Corpus Linguistic Approach to Meaning-Making Patterns in Surveillance Discourse [PhD, University of Birmingham]. https://etheses.bham.ac.uk/id/eprint/9778
Installing from CRAN
In an R session type
install.packages('CorporaCoCo')Installing the latest development version directly from GitHub
Linux
In an R session type:
pkg_file <- tempfile()
download.file(url = 'https://github.com/mahlberg-lab/CorporaCoCo/archive/master.tar.gz', mode = 'wb', method = 'wget', destfile = pkg_file)
install.packages(pkg_file, repos = NULL, type = 'source')Mac OSX / Windows
download.file may not support fetching https URLs. Alternatively, you
can use the the CRAN package downloader
to fetch the archive instead:
# install.packages("downloader")
pkg_file <- tempfile()
downloader::download(url = 'https://github.com/mahlberg-lab/CorporaCoCo/archive/master.tar.gz', mode = 'wb', destfile = pkg_file)
install.packages(pkg_file, repos = NULL, type = 'source')Alternatively use the devtools CRAN package
If you have the CRAN package devtools you can use this to install directly from github:
# install.packages("devtools")
devtools::install_github("mahlberg-lab/CorporaCoCo")Testing
Unit tests are located in the /tests/testthat directory. We use the 'testthat' package to generate tests.
To run the tests yourself, just do:
devtools::test()
ℹ Loading CorporaCoCo
ℹ Testing CorporaCoCo
✔ | OK F W S | Context
✔ | 39 | coco [0.5 s]
✔ | 4 | corp_concordance [0.1 s]
✔ | 10 | corp_cooccurrence
✔ | 12 | corp_text
✔ | 2 | surface_coco [0.1 s]
✔ | 44 | surface [0.3 s]══ Results ══════════════════════════════════
Duration: 1.2 s
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 111 ]
Continuous integration testing is set up using GitHub Actions - see the .github directory in the root of this project for more information.