align_test.set: Computing One-to-Many Word Alignment Using a Parallel Corpus for a Given Test Set

Description

For a given parallel corpus based on IBM Model 1, it aligns the words of a given sentence-aligned test set.

Usage

align_test.set(file_train1, file_train2, 
              tst.set_sorc, tst.set_trgt, 
              nrec = -1, nlen = -1, 
              encode.sorc = 'unknown', encode.trgt = 'unknown',
              minlen1 = 5, maxlen1 = 40, minlen2 = 5, maxlen2 = 40, 
              removePt = TRUE, all = FALSE, null.tokens = TRUE, 
              iter = 3, f1 = 'fa', e1 = 'en', 
              dtfile_path = NULL, file_align = 'alignment')

Arguments

file_train1

the name of source language file in training set.

file_train2

the name of target language file in training set.

tst.set_sorc

the name of source language file in test set.

tst.set_trgt

the name of target language file in test set.

nrec

the number of sentences in the training set to be read. If -1, it considers all sentences.

nlen

the number of sentences in the test set to be read. If -1, it considers all sentences.

encode.sorc

encoding to be assumed for the source language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see scan function.

encode.trgt

encoding to be assumed for the target language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see scan function.

minlen1

a minimum length of sentences in training set.

maxlen1

a maximum length of sentences in training set.

minlen2

a minimum length of sentences in test set.

maxlen2

a maximum length of sentences in test set.

removePt

logical. If TRUE, it removes all punctuation marks.

all

logical. If TRUE, it considers the third argument (lower = TRUE) in culf function.

null.tokens

logical. If TRUE, "null" is added at the first of each source sentence of the test set.

iter

the number of iterations for IBM Model 1.

it is a notation for the source language (default = 'fa').

it is a notation for the target language (default = 'en').

dtfile_path

if NULL (usually for the first time), a data.table will be created contaning cross words of all sentences with their matched probabilities. It saves into a file named as a combination of f1, e1, nrec and iter as "f1.e1.nrec.iter.RData".

If specific file name is set, it will be read and continue the rest of the function, i.e. : finding the word alignments for the test set.

file_align

the output results file name.

Value

an RData object as "file_align.nrec.iter.Rdata".

Details

If dtfile_path = NULL, the following question will be asked:

"Are you sure that you want to run the word_alignIBM1 function (It takes time)? (Yes/ No: if you want to specify word alignment path, please press 'No'.)

References

Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.

Lopez A. (2008), "Statistical Machine Translation.", ACM Computing Surveys, 40(3).

Peter F., Brown J. (1990), "A Statistical Approach to Machine Translation.", Computational Linguistics, 16(2), 79-85.

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

Examples

Run this code

# NOT RUN {
# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .
# In addition, in this example we use the first five sentence pairs of training set as the 
# test set.
# }
# NOT RUN {
ats = align_test.set ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                      'http://www.um.ac.ir/~sarmad/word.a/euro.en',  
                      'http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                      'http://www.um.ac.ir/~sarmad/word.a/euro.en',
                       nrec = 100,nlen = 5, encode.sorc = 'UTF-8',)               
# }

Run the code above in your browser using DataLab