sim.words: Similarity-measures for words between two languages, based on co-occurrences in parallel text

Description

Based on co-occurrences in a parallel text, this convenience function (a wrapper around various other functions from this package) efficiently computes something close to translational equivalence.

Usage

sim.words(text1, text2 = NULL, method = res, weight = NULL, 
	lowercase = TRUE, best = FALSE, tol = 0)

Arguments

Value

When best = F, a single sparse matrix is returned of type dgCMatrix with the values of the statistic chosen. All unique wordforms of text1 are included as row names, and those from text2 as column names.

When best = T, a list of two sparse matrices is returned:simthe same matrix as abovebesta sparse pattern matrix of type ngCMatrix with the same dimensions as the previous matrix. Only the `best' translations between the two languages are marked

Details

Care is taken in this function to match multiple verses that are translated into one verse, see bibles for a survey of the encoding assumptions taken here.

The parameter method can take anything that is also available for assocSparse. Similarities are computed using that function.

When weight is specified, the similarities are computed using cosSparse with default setting of norm = norm2. All available weights can also be used here.

The option best = T uses rowMax and colMax. This approach to get the `best' translation is really crude, but it works reasonably well with one-to-one and many-to-one situations. This option takes rather a lot more time to finish, as row-wise maxima for matrices is not trivial to optimize. Consider raising tol, as this removes low values that won't be important for the maxima anyway. See examples below. Guidelines for the value of tol are difficult to give, as it depends on the method used, but also on the distribution of the data (i.e. the number of sentences, and the frequency distribution of the words in the text). Some suggestions:

{when weight is specified, results range between -1 and +1. Then tol = 0.1 should never lead to problems, but often even tol = 0.3 or higher will lead to identical results. } {when weight is not specified (i.e. assocSparse will be used), then results range between -inf and +inf, so the tolerance is more problematic. In general, tol = 2 seems to be unproblematic. Higher tolerance, e.g. tol = 10 can be used to find the `obvious' translations, but you will loose some of the more incidental co-occurrences. }

References

Mayer, Thomas and Michael Cysouw. 2012. Language comparison through sparse multilingual word alignment. Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, 54--62. Avignon: Association for Computational Linguistics.

Examples

Run this code

data(bibles)

# ----- small example of co-occurrences -----

# as an example, just take partially overlapping parts of two bibles
# sim.words uses the names to get the paralellism right, so this works
eng <- bibles$eng[1:5000]
deu <- bibles$deu[2000:7000]
sim <- sim.words(eng, deu, method = res)

# but the statistics are not perfect (because too little data)
# sorted co-occurrences for the english word "your" in German:
sort(sim["your",], decreasing = TRUE)[1:10]

# ----- complete example of co-occurrences -----

# running the complete bibles takes a bit more time (but still manageable)
system.time(sim <- sim.words(bibles$eng, bibles$deu, method = res))

# results are much better
# sorted co-occurrences for the english word "your" in German:
sort(sim["your",], decreasing = TRUE)[1:10]

# ----- look for 'best' translations -----

# note that selecting the 'best' takes even more time
system.time(sim2 <- sim.words(bibles$eng, bibles$deu, method = res, best = TRUE))

# best co-occurrences for the English word "your"
which(sim2$best["your",])

# but can be made faster by removing low values
# (though the boundary in \code{tol =  5} depends on the method used
system.time(sim3 <- sim.words(bibles$eng, bibles$deu, best = TRUE, method = res, tol = 5))

# note that the decision on the 'best' remains the same here
all.equal(sim2$best, sim3$best)

# ----- computations also work with other languages -----

# All works completely language-independent
# translations for 'we' in Tagalog:
sim <- sim.words(bibles$eng, bibles$tgl, best = TRUE, weight = idf, tol = 0.1)
which(sim$best["we",])

Run the code above in your browser using DataLab