Learn R Programming

word.alignment (version 1.1)

bidictionary: Building an Automatic Bilingual Dictionary

Description

It builds an automatic bilingual dictionary of two languages based on given sentence-aligned parallel corpus.

Usage

bidictionary (..., n = -1L, iter = 15, prob = 0.8,  
              dtfile.path = NULL, name.sorc = 'f', name.trgt = 'e')

Arguments

...

Further arguments to be passed to prepare.data.

n

Number of sentences to be read.

iter

the number of iterations for IBM Model 1.

prob

the minimum word translation probanility.

dtfile.path

if NULL (usually for the first time), a data.table will be created contaning cross words of all sentences with their matched probabilities. It saves into a file named as a combination of name.sorc, name.trgt, n and iter as "f.e.n.iter.RData".

If specific file name is set, it will be read and continue the rest of the function, i.e. : finding dictionary of two given languages.

name.sorc

source language's name in mydictionary.

name.trgt

traget language's name in mydictionary.

Value

A list.

time

A number. (in second/minute/hour)

number_input

An integer.

Value_prob

A decimal number between 0 and 1.

iterIBM1

An integer.

dictionary

A matrix.

Details

The results depend on the corpus. As an example, we have used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size of 170 Mb. For the 10,000 first sentences, we have a nice dictionary. It just takes 1.356784 mins using an ordinary computer. The results can be found at

http://www.um.ac.ir/~sarmad/word.a/bidictionary.pdf

References

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

See Also

scan

Examples

Run this code
# NOT RUN {
# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .

# }
# NOT RUN {
dic1 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      n = 2000, encode.sorc = 'UTF-8', 
                      name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH')
              
dic2 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      n = 2000, encode.sorc = 'UTF-8', 
                      name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH',
                      remove.pt = FALSE)
# }

Run the code above in your browser using DataLab