Symmetrization: Calculating Symmetric Word Alignment

Description

It calculates source-to-target and target-to-source alignments using IBM Model 1, as well as symmetric word alignment models such as intersection, union, or grow-diag.

Usage

Symmetrization(file_train1, file_train2, 
               method = c('union', 'intersection', 'grow-diag'), 
               nrec = -1, encode.sorc = 'unknown', encode.trgt = 'unknown', 
	       iter = 4, minlen = 5, maxlen = 40, removePt = TRUE, 
               all = FALSE, f1 = 'fa', e1 = 'en')
               
# S3 method for symmet
print(x, ...)

Arguments

file_train1

the name of source language file in training set.

file_train2

the name of target language file in training set.

method

character string specifying the symmetric word alignment method (union, intersection, or grow-diag alignment).

nrec

the number of sentences to be read.If -1, it considers all sentences.

encode.sorc

encoding to be assumed for the source language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see scan function.

encode.trgt

encoding to be assumed for the target language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see scan function.

iter

the number of iterations for IBM Model 1.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

removePt

logical. If TRUE, it removes all punctuation marks.

all

logical. If TRUE, it considers the third argument (lower = TRUE) in culf function.

it is a notation for the source language (default = 'fa').

it is a notation for the target language (default = 'en').

an object of class 'symmet'.

…

further arguments passed to or from other methods.

Value

Symmetrization returns an object of class 'symmet'.

An object of class 'symmet' is a list containing the following components:

time

A number. (in second/minute/hour)

method

symmetric word alignment method (union, intersection, or grow-diag alignment).

alignment

A list of alignment for each sentence pair .

a vector of source sentences.

Details

Here, word alignment is not only a map of the target language to the source language and it is considered as a symmetric alignment such as union, or intersection, or grow-diag alignment.

References

Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.

http://statmt.org/europarl/v7/bg-en.tgz

Examples

Run this code

# NOT RUN {
# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .

# }
# NOT RUN {
S1 = Symmetrization ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en',
                      nrec = 200, encode.sorc = 'UTF-8')
                      
S2 = Symmetrization ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en',
                      nrec = 200, encode.sorc = 'UTF-8', method = 'grow-diag')
# }

Run the code above in your browser using DataLab

Last chance! 50% off unlimited learning