mix, mp, hmm,
query, tag, simhash, and keywords.
see Detail for more information.worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = USERPATH,
idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5,
encoding = "UTF-8", detect = T, symbol = F, lines = 1e+05,
output = NULL, bylines = F)mix, mp, hmm,
query, tag, simhash, and keywords.DICTPATH,
and the value is used for mix, mp, query,
tag, simhash and keywords workers.HMMPATH,
and the value is used for mix, hmm, query,
tag, simhash and keywords workers.USERPATH,
and the value is used for mix, tag and mp workers.IDFPATH,
and the value is used for simhash and keywords workers.STOPPATH,
and the value is used for simhash, keywords, tagger and segment workers. Encoding of this file is checked by filecodingquery workers.simhash and keywords workers.encoding will be
ignore.filecoding function. If encoding
detection is enable, the value of encoding will be
ignore.$.$ such as
WorkerName$symbol = T. Some private settings are fixed
when a engine is initialized, and you can get then by
WorkerName$PrivateVarible.
Maximum probability segmentation model uses Trie tree to construct
a directed acyclic graph and uses dynamic programming algorithm. It
is the core segmentation algorithm. dict and user
should be provided when initializing jiebaR worker.
Hidden Markov Model uses HMM model to determine status set and
observed set of words. The default HMM model is based on People's Daily
language library. hmm should be provided when initializing
jiebaR worker.
MixSegment model uses both Maximum probability segmentation model
and Hidden Markov Model to construct segmentation. dict
hmm and user should be provided when initializing
jiebaR worker.
QuerySegment model uses MixSegment to construct segmentation and then
enumerates all the possible long words in the dictionary. dict,
hmm and qmax should be provided when initializing
jiebaR worker.
Speech Tagging worker uses MixSegment model to cut word and
tag each word after segmentation using labels compatible with
ictclas. dict,
hmm and user should be provided when initializing
jiebaR worker.
Keyword Extraction worker uses MixSegment model to cut word and use
TF-IDF algorithm to find the keywords. dict ,hmm,
idf, stop_word and topn should be provided when initializing
jiebaR worker.
Simhash worker uses the keyword extraction worker to find the keywords
and uses simhash algorithm to compute simhash. dict
hmm, idf and stop_word should be provided when initializing
jiebaR worker.### Note: Can not display Chinese character on Windows here.
words = "hello world"
test1 = worker()
test1
test1 <= words
test <= "./temp.txt"
engine2 = worker("mix",symbol = T)
engine2 <= "./temp.txt"
engine2
engine2$symbol = T
engine2
engine2 <= words
engine3 = worker(type = "mix", dict = "dict_path",symbol = T)
engine3 <= "./temp.txt"
keys = worker("keywords", topn = 1)
keys <= words
tagger = worker("tag")
tagger <= wordsRun the code above in your browser using DataLab