prepareData: Initial Preparations of Bitext before the Word Alignment and the Evaluation of Word Alignment Quality

Description

For a given Sentence-Aligned Parallel Corpus, it prepars sentence pairs as an input for word_alignIBM1 and Evaluation1 functions in this package.

Usage

prepareData(file1, file2, nrec = -1, 
	   encode.sorc = 'unknown', encode.trgt = 'unknown',
           minlen = 5, maxlen = 40, all = FALSE, 
           removePt = TRUE, word_align = TRUE)

Arguments

file1

the name of source language file.

file2

the name of target language file.

nrec

the number of sentences to be read.If -1, it considers all sentences.

encode.sorc

encoding to be assumed for the source language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see scan function.

encode.trgt

encoding to be assumed for the target language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see scan function.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

all

logical. If TRUE, it considers the third argument (lower = TRUE) in culf function.

removePt

logical. If TRUE, it removes all punctuation marks.

word_align

logical. If FALSE, it divides each sentence into its words. Results can be used in Symmetrization, cons.agn, align_test.set and Evaluation1 functions.

Value

A list.

if word_align = TRUE

len1

An integer.

A matrix (n*2), where n is the number of remained sentence pairs after preprocessing.

otherwise,

initial

An integer.

used

An integer.

source.tok

A list of words for each the source sentence.

target.tok

A list of words for each the target sentence.

Details

It balances between source and target language as much as possible. For example, it removes extra blank sentences and equalization sentence pairs. Also, using culf function, it converts the first letter of each sentence into lowercase. Moreover, it removes short and long sentences.

References

Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.

Examples

Run this code

# NOT RUN {
# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .
# }
# NOT RUN {
aa1 = prepareData ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    nrec = 20, encode.sorc = 'UTF-8')
 
aa2 = prepareData ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    nrec = 20, encode.sorc = 'UTF-8', word_align = FALSE)
                   
aa3 = prepareData ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    nrec = 20, encode.sorc = 'UTF-8', removePt = FALSE)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

References

See Also

Examples