Last chance! 50% off unlimited learning
Sale ends in
This function can initialize jiebaR workers. You can initialize different
kinds of workers including mix
, mp
, hmm
,
query
, full
, tag
, simhash
, and keywords
.
see Detail for more information.
worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = USERPATH,
idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5,
encoding = "UTF-8", detect = T, symbol = F, lines = 1e+05,
output = NULL, bylines = F, user_weight = "max")
The type of jiebaR workers including mix
, mp
, hmm
, full
,
query
, tag
, simhash
, and keywords
.
A path to main dictionary, default value is DICTPATH
,
and the value is used for mix
, mp
, query
, full
,
tag
, simhash
and keywords
workers.
A path to Hidden Markov Model, default value is HMMPATH
, full
,
and the value is used for mix
, hmm
, query
,
tag
, simhash
and keywords
workers.
A path to user dictionary, default value is USERPATH
,
and the value is used for mix
, full
, tag
and mp
workers.
A path to inverse document frequency, default value is IDFPATH
,
and the value is used for simhash
and keywords
workers.
A path to stop word dictionary, default value is STOPPATH
,
and the value is used for simhash
, keywords
, tagger
and segment
workers. Encoding of this file is checked by file_coding
, and it should be UTF-8 encoding. For segment
workers, the default STOPPATH
will not be used, so you should provide another file path.
Whether to write the output to a file, or return a the result in a object. This value will only be used when the input is a file path. The default value is TRUE. The value is used for segment and speech tagging workers.
Max query length of words, and the value
is used for query
workers.
The number of keywords, and the value is used for
simhash
and keywords
workers.
The encoding of the input file. If encoding
detection is enable, the value of encoding
will be
ignore.
Whether to detect the encoding of input file
using file_coding
function. If encoding
detection is enable, the value of encoding
will be
ignore.
Whether to keep symbols in the sentence.
The maximal number of lines to read at one time when input is a file. The value is used for segmentation and speech tagging workers.
A path to the output file, and default worker will generate file name by system time stamp, the value is used for segmentation and speech tagging workers.
return the result by the lines of input files
the weight of the user dict words. "min" "max" or "median".
This function returns an environment containing segmentation
settings and worker. Public settings can be modified
using $
.
The package uses initialized engines for word segmentation, and you
can initialize multiple engines simultaneously. You can also reset the model
public settings using $
such as
WorkerName$symbol = T
. Some private settings are fixed
when a engine is initialized, and you can get then by
WorkerName$PrivateVarible
.
Maximum probability segmentation model uses Trie tree to construct
a directed acyclic graph and uses dynamic programming algorithm. It
is the core segmentation algorithm. dict
and user
should be provided when initializing jiebaR worker.
Hidden Markov Model uses HMM model to determine status set and
observed set of words. The default HMM model is based on People's Daily
language library. hmm
should be provided when initializing
jiebaR worker.
MixSegment model uses both Maximum probability segmentation model
and Hidden Markov Model to construct segmentation. dict
hmm
and user
should be provided when initializing
jiebaR worker.
QuerySegment model uses MixSegment to construct segmentation and then
enumerates all the possible long words in the dictionary. dict
,
hmm
and qmax
should be provided when initializing
jiebaR worker.
FullSegment model will enumerates all the possible words in the dictionary.
Speech Tagging worker uses MixSegment model to cut word and
tag each word after segmentation using labels compatible with
ictclas. dict
,
hmm
and user
should be provided when initializing
jiebaR worker.
Keyword Extraction worker uses MixSegment model to cut word and use
TF-IDF algorithm to find the keywords. dict
,hmm
,
idf
, stop_word
and topn
should be provided when initializing
jiebaR worker.
Simhash worker uses the keyword extraction worker to find the keywords
and uses simhash algorithm to compute simhash. dict
hmm
, idf
and stop_word
should be provided when initializing
jiebaR worker.
# NOT RUN {
### Note: Can not display Chinese characters here.
# }
# NOT RUN {
words = "hello world"
engine1 = worker()
segment(words, engine1)
# "./temp.txt" is a file path
segment("./temp.txt", engine1)
engine2 = worker("hmm")
segment("./temp.txt", engine2)
engine2$write = T
segment("./temp.txt", engine2)
engine3 = worker(type = "mix", dict = "dict_path",symbol = T)
segment("./temp.txt", engine3)
# }
# NOT RUN {
# }
# NOT RUN {
### Keyword Extraction
engine = worker("keywords", topn = 1)
keywords(words, engine)
### Speech Tagging
tagger = worker("tag")
tagging(words, tagger)
### Simhash
simhasher = worker("simhash", topn = 1)
simhash(words, simhasher)
distance("hello world" , "hello world!" , simhasher)
show_dictpath()
# }
Run the code above in your browser using DataLab