bibles: A selection of bible-texts

Description

A selection of six bible texts as prepared by the paralleltext.info project.

Usage

data(bibles)

Arguments

docType

data

source

All data is taken from paralleltext.info. For further information and links to the original versions, please check the detailed meta-information for all texts as provided on the webpage.

Details

For details on the formatting, see paralleltext.info. Basically, all verse-numbering is harmonized, the text is unicode normalized, translations that capture multiple verses are included in the first of those verses, with the others left empty. Empty verses are thus a sign of combined translations. Verses that are not translated simply do not occur in the original files. Most importantly, the text are tokenized as to wordform, i.e. all punctuation and other non-word-based symbols are separated by spaces. In this way, space can be used for a quick wordform-based tokenization. The addition of spaces has been manually corrected to achieve a high precision of language-specific wordform tokenization.

The Bible texts are provided as named vectors of strings, each containing one verse. The names of the vector are codes for the verses. See Mayer & Cysouw (2014) for more information about the verse IDs and other formatting issues.

References

Mayer, Thomas and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. Proceedings of LREC 2014.

Examples

Run this code

# ----- load data -----

data(bibles)

# ----- separate into sparse matrices -----

# use splitText to turn a bible into a sparse matrix of wordforms x verses
E <- splitText(bibles$eng, simplify = TRUE, lowercase = FALSE)

# all wordforms from the first verse
# (internally using pure Unicode collation, i.e. ordering is determined by Unicode numbering)
which(E[,1] > 0)

# ----- co-occurrence across text -----

# how often do 'father' and 'mother' co-occur in one verse?
# (ignore warnings of chisq.test, because we are not interested in p-values here)

( cooc <- table(E["father",] > 0, E["mother",] > 0) )
suppressWarnings( chisq.test(cooc)$residuals )

# the function 'sim.words' does such computations efficiently 
# for all 15000 x 15000 pairs of words at the same time

system.time( sim <- sim.words(bibles$eng, lowercase = FALSE) )
sim["father", "mother"]

Run the code above in your browser using DataLab