bidictionary: Building an Automatic Bilingual Dictionary

Description

It builds an automatic bilingual dictionary of two languages based on given sentence-aligned parallel corpus.

Usage

bidictionary (..., n = -1L, iter = 15, prob = 0.8,  
              dtfile.path = NULL, name.sorc = 'f', name.trgt = 'e')

Arguments

...

Further arguments to be passed to prepare.data.

Number of sentences to be read.

iter

the number of iterations for IBM Model 1.

prob

the minimum word translation probanility.

dtfile.path

if NULL (usually for the first time), a data.table will be created contaning cross words of all sentences with their matched probabilities. It saves into a file named as a combination of name.sorc, name.trgt, n and iter as "f.e.n.iter.RData".

If specific file name is set, it will be read and continue the rest of the function, i.e. : finding dictionary of two given languages.

name.sorc

source language's name in mydictionary.

name.trgt

traget language's name in mydictionary.

Value

A list.

time

A number. (in second/minute/hour)

number_input

An integer.

Value_prob

A decimal number between 0 and 1.

iterIBM1

An integer.

dictionary

A matrix.

Details

The results depend on the corpus. As an example, we have used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size of 170 Mb. For the 10,000 first sentences, we have a nice dictionary. It just takes 1.356784 mins using an ordinary computer. The results can be found at

http://www.um.ac.ir/~sarmad/word.a/bidictionary.pdf

References

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

Examples

Run this code

# NOT RUN {
# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .

# }
# NOT RUN {
dic1 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      n = 2000, encode.sorc = 'UTF-8', 
                      name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH')
              
dic2 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      n = 2000, encode.sorc = 'UTF-8', 
                      name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH',
                      remove.pt = FALSE)
# }

Run the code above in your browser using DataLab