mydictionary: Building an Automatic Bilingual Dictionary

Description

It builds an automatic bilingual dictionary of two languages based on given sentence-aligned parallel corpus.

Usage

mydictionary(file_train1, file_train2, nrec = -1, 
             encode.sorc = 'unknown', encode.trgt = 'unknown', 
	         iter = 15, prob = 0.8, minlen = 5, maxlen = 40,
             lang1 = 'Farsi', lang2 = 'English', removePt = TRUE, 
             dtfile_path = NULL, f1 = 'fa', e1 = 'en', 
             result_file = 'mydictionaryResults')

Arguments

file_train1

the name of source language file in training set.

file_train2

the name of target language file in training set.

nrec

the number of sentences to be read.If -1, it considers all sentences.

encode.sorc

encoding to be assumed for the source language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see scan function.

encode.trgt

encoding to be assumed for the target language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see scan function.

iter

the number of iterations for IBM Model 1.

prob

the minimum word translation probanility.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

lang1

source language's name in mydictionary.

lang2

traget language's name in mydictionary.

removePt

logical. If TRUE, it removes all punctuation marks.

dtfile_path

if NULL (usually for the first time), a data.table will be created contaning cross words of all sentences with their matched probabilities. It saves into a file named as a combination of f1, e1, nrec and iter as "f1.e1.nrec.iter.RData".

If specific file name is set, it will be read and continue the rest of the function, i.e. : finding dictionary of two given languages.

it is a notation for the source language (default = 'fa').

it is a notation for the target language (default = 'en').

result_file

the output results file name.

Value

A list.

time

A number. (in second/minute/hour)

number_input

An integer.

Value_prob

A decimal number between 0 and 1.

iterIBM1

An integer.

dictionary

A matrix.

Details

The results depend on the corpus. As an example, we have used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size of 170 Mb. For the 10,000 first sentences, we have a nice dictionary. It just takes 1.356784 mins using an ordinary computer. The results can be found at

http://www.um.ac.ir/~sarmad/word.a/mydictionary.pdf

References

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

Examples

Run this code

# NOT RUN {
# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .

# }
# NOT RUN {
dic1 = mydictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      nrec = 2000, encode.sorc = 'UTF-8', lang1 = 'BULGARIAN')
              
dic2 = mydictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      nrec = 2000, encode.sorc = 'UTF-8', lang1 = 'BULGARIAN',
                      removePt = FALSE)
# }

Run the code above in your browser using DataLab