splitText: Construct sparse matrices from parallel texts

Description

Convenience functions to read parallel texts and to split parallel texts into sparse matrices.

Usage

splitText(text, globalSentenceID = NULL, localSentenceID = names(text), sep = " ",
	simplify = FALSE, lowercase = TRUE)
read.text(file)

Arguments

Value

When simplify = F, a list is returned with the following elements:runningWordssingle vector with complete text (ignoring original sentence breaks), separated into strings according to sepwordformsvector with all wordforms as attested in the text (according to the specified separator). Ordering of wordforms is done by ttMatrix, which by default uses the "C" collation locale.lowercaseonly returned when lowercase = T. Vector with all unique wordforms after conversion to lowercase.RSSparse pattern matrix of class ngCMatrix with runningWords (R) as rows and sentence IDs (S) as columns. When globalSentenceID = NULL, then the sentences are the elements of the original text. Else, they are the specified globalSentenceIDs.WRSparse pattern matrix of class ngCMatrix with wordforms (W) as rows and running words (R) as columns.wWonly returned when lowercase = T. Sparse pattern matrix of class ngCMatrix linking between lowercased wordforms and original wordforms.When simplify = T the result is a single sparse Matrix (of type dgCMatrix) linking wordforms (either with or without case) to sentences (either global or local). Note that the result with options (simplify = T, lowercase = F) will result in the sparse matrix as available at paralleltext.info (there the matrix is in .mtx format), with the wordforms included into the matrix as row names. However, note that the resulting matrix from the code here will include frequencies for words that occur more than once per sentence. These have been removed for the .mtx version available online.

Details

The function splitText is actually just a nice examples of how pwMatrix, jMatrix, and ttMatrix can be used to work with parallel texts.

The function read.text is a convenience function to read parallel texts as prepared in the paralleltext.info project.

Examples

Run this code

# a trivial examples to see the results of this function:
text <- c("This is a sentence .","A sentence this is !","Is this a sentence ?")
splitText(text)
splitText(text, simplify = TRUE, lowercase = FALSE)

# reasonably quick with complete bibles (about 1-2 second per complete bible)
# texts with only New Testament is even quicker
data(bibles)
system.time(eng <- splitText(bibles$eng, bibles$verses))
system.time(deu <- splitText(bibles$deu, bibles$verses))

# Use example: Number of co-ocurrences between two bibles
# How often do words from the one language cooccur with words from the other language?
ENG <- (eng$wW * 1) %*% (eng$WR * 1) %*% (eng$RS * 1)
DEU <- (deu$wW * 1) %*% (deu$WR * 1) %*% (deu$RS * 1)
C <- tcrossprod(ENG,DEU)
rownames(C) <- eng$lowercase
colnames(C) <- deu$lowercase
C[	c("father","father's","son","son's"),
	c("vater","vaters","sohn","sohne","sohnes","sohns")
	]

# Pure counts are not very interesting. This is better:
R <- assocSparse(t(ENG), t(DEU))
rownames(R) <- eng$lowercase
colnames(R) <- deu$lowercase
R[	c("father","father's","son","son's"),
	c("vater","vaters","sohn","sohne","sohnes","sohns")
	]

# For example: best co-occurrences for the english word "mine"
sort(R["mine",], decreasing = TRUE)[1:10]

# To get a quick-and-dirty translation matrix:
# adding maxima from both sides work quite well
best <-	  colMax(R, which = TRUE, ignore.zero = FALSE)$which 
	+ rowMax(R,which = TRUE, ignore.zero = FALSE)$which
best <- as(best, "nMatrix")

which(best["your",])
which(best["went",])

# all of the above is also performed by the function sim.words

# A final speed check:
# split all 4 texts, and simplify them into one matrix
# They have all the same columns, so they can be rBind
system.time(all <- sapply(bibles[-1], function(x){splitText(x, bibles$verses, simplify = TRUE)}))
all <- do.call(rBind, all)

# then try a single co-occerrence measure on all pairs from these 72K words
# (so we are doing about 2.6e9 comparisons here!)
system.time( S <- cosSparse(t(all)) )

# this goes extremely fast! As long as everything fits into RAM this works nicely.
# Note that S quickly gets large
print(object.size(S), units = "auto")

# but most of it can be thrown away, because it is too low anyway
# this leads to a factor 10 reduction in size:
S <- drop0(S, tol = 0.2)
print(object.size(S), units = "auto")