jiebaR: A package for Chinese text segmentation

Description

This is a package for Chinese text segmentation, keyword extraction and speech tagging with Rcpp and cppjieba. JiebaR supports four types of segmentation mode: Maximum Probability, Hidden Markov Model, Query Segment and Mix Segment.

Arguments

Details

You can use custom dictionary to be included in the jiebaR default dictionary. JiebaR can also identify new words, but adding your own new words will ensure a higher accuracy.

References

CppJieba https://github.com/aszxqw/cppjieba;

Examples

Run this code

### Note: Can not display Chinese character here.
## Not run: 
# words = "hello world"
# test1 = worker()
# test1 <= words
# ## End(Not run)

## Not run: 
# test <= "./temp.txt"
# engine2 = worker("hmm")
# engine2 <= "./temp.txt"
# engine2$write = T
# engine2 <= "./temp.txt"
# engine3 = worker(type = "mix", dict = "dict_path",symbol = T)
# engine3 <= "./temp.txt"
#  ## End(Not run)
## Not run: 
# ### Keyword Extraction
# keys = worker("keywords", topn = 1)
# keys <= words
# 
# ### Speech Tagging 
# tagger = worker("tag")
# tagger <= words
# 
# ### Simhash
# simhasher = worker("simhash", topn = 1)
# simhasher <= words
# distance("hello world" , "hello world!" , simhasher)
# 
# show_dictpath()
# ## End(Not run)

Run the code above in your browser using DataLab

Description

Arguments

Details

References

See Also

Examples