udpipe_train: Train a UDPipe model

Description

Train a UDPipe model which allows to do Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing or a combination of those.

This function allows you to build models based on data in in CONLL-U format as described at https://universaldependencies.org/format.html. At the time of writing open data in CONLL-U format for more than 50 languages are available at https://universaldependencies.org. Most of these are distributed under the CC-BY-SA licence or the CC-BY-NC-SA license.

This function allows to build annotation tagger models based on these data in CONLL-U format, allowing you to have your own tagger model. This is relevant if you want to tune the tagger to your needs or if you don't want to use ready-made models provided under the CC-BY-NC-SA license as shown at udpipe_load_model

Usage

udpipe_train(
  file = file.path(getwd(), "my_annotator.udpipe"),
  files_conllu_training,
  files_conllu_holdout = character(),
  annotation_tokenizer = "default",
  annotation_tagger = "default",
  annotation_parser = "default"
)

Value

A list with elements

file: The path to the model, which can be used in udpipe_load_model
annotation_tokenizer: The input argument annotation_tokenizer
annotation_tagger: The input argument annotation_tagger
annotation_parser: The input argument annotation_parser
errors: Messages from the UDPipe process indicating possible errors for example when passing the wrong arguments to the annotation_tokenizer, annotation_tagger or annotation_parser

Arguments

file: full path where the model will be saved. The model will be stored as a binary file which udpipe_load_model can handle. Defaults to 'my_annotator.udpipe' in the current working directory.
files_conllu_training: a character vector of files in CONLL-U format used for training the model
files_conllu_holdout: a character vector of files in CONLL-U format used for holdout evalution of the model. This argument is optional.
annotation_tokenizer: a string containing options for the tokenizer. This can be either 'none' or 'default' or a list of options as mentioned in the UDPipe manual. See the vignette vignette("udpipe-train", package = "udpipe") or go directly to https://ufal.mff.cuni.cz/udpipe/1/users-manual#model_training_tokenizer for a full description of the options or see the examples below. Defaults to 'default'. If you specify 'none', the model will not be able to perform tokenization.
annotation_tagger: a string containing options for the pos tagger and lemmatiser. This can be either 'none' or 'default' or a list of options as mentioned in the UDPipe manual. See the vignette vignette("udpipe-train", package = "udpipe") or go directly to https://ufal.mff.cuni.cz/udpipe/1/users-manual#model_training_tagger for a full description of the options or see the examples below. Defaults to 'default'. If you specify 'none', the model will not be able to perform POS tagging or lemmatization.
annotation_parser: a string containing options for the dependency parser. This can be either 'none' or 'default' or a list of options as mentioned in the UDPipe manual. See the vignette vignette("udpipe-train", package = "udpipe") or go directly to https://ufal.mff.cuni.cz/udpipe/1/users-manual#model_training_parser for a full description of the options or see the examples below. Defaults to 'default'. If you specify 'none', the model will not be able to perform dependency parsing.

Details

In order to train a model, you need to provide files which are in CONLL-U format in argument files_conllu_training. This can be a vector of files or just one file. If you do not have your own CONLL-U files, you can download files for your language of choice at https://universaldependencies.org.

At the time of writing open data in CONLL-U format for 50 languages are available at https://universaldependencies.org, namely for: ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin, latvian, lithuanian, norwegian, old_church_slavonic, persian, polish, portuguese, romanian, russian, sanskrit, slovak, slovenian, spanish, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese.

References

https://ufal.mff.cuni.cz/udpipe/1/users-manual

Examples

Run this code

## You need to have a file on disk in CONLL-U format, taking the toy example file put in the package
file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
file_conllu
cat(head(readLines(file_conllu), 3), sep="\n")

if (FALSE) {
##
## This is a toy example showing how to build a model, it is not a good model whatsoever, 
##   because model building takes more than 5 seconds this model is saved also in 
##   the file at system.file(package = "udpipe", "dummydata", "toymodel.udpipe")
##
m <- udpipe_train(file = "toymodel.udpipe", files_conllu_training = file_conllu, 
  annotation_tokenizer = list(dimension = 16, epochs = 1, batch_size = 100, dropout = 0.7), 
  annotation_tagger = list(iterations = 1, models = 1, 
     provide_xpostag = 1, provide_lemma = 0, provide_feats = 0, 
     guesser_suffix_rules = 2, guesser_prefix_min_count = 2), 
  annotation_parser = list(iterations = 2, 
     embedding_upostag = 20, embedding_feats = 20, embedding_xpostag = 0, embedding_form = 50, 
     embedding_lemma = 0, embedding_deprel = 20, learning_rate = 0.01, 
     learning_rate_final = 0.001, l2 = 0.5, hidden_layer = 200, 
     batch_size = 10, transition_system = "projective", transition_oracle = "dynamic", 
     structured_interval = 10))
}

file_model <- system.file(package = "udpipe", "dummydata", "toymodel.udpipe")
ud_toymodel <- udpipe_load_model(file_model)
x <- udpipe_annotate(object = ud_toymodel, x = "Ik ging deze morgen naar de bakker brood halen.")
x <- as.data.frame(x)

##
## The above was a toy example showing how to build a model, if you want real-life scenario's
## look at the training parameter examples given below and train it on your CONLL-U file
##
## Example training arguments used for the models available at udpipe_download_model
data(udpipe_annotation_params)
head(udpipe_annotation_params$tokenizer)
head(udpipe_annotation_params$tagger)
head(udpipe_annotation_params$parser)
if (FALSE) {
## More details in the package vignette:
vignette("udpipe-train", package = "udpipe")
}

Run the code above in your browser using DataLab