jiebaR: A package for Chinese text segmentation

Description

This is a package for Chinese text segmentation, keyword extraction and speech tagging with Rcpp and cppjieba.

Arguments

Details

You can use custom dictionary. JiebaR can also identify new words, but adding new words will ensure higher accuracy.

References

CppJieba https://github.com/aszxqw/cppjieba;

Examples

Run this code

# NOT RUN {
### Note: Can not display Chinese characters here.
# }
# NOT RUN {
words = "hello world"
engine1 = worker()
segment(words, engine1)

# "./temp.txt" is a file path

segment("./temp.txt", engine1)

engine2 = worker("hmm")
segment("./temp.txt", engine2)

engine2$write = T
segment("./temp.txt", engine2)

engine3 = worker(type = "mix", dict = "dict_path",symbol = T)
segment("./temp.txt", engine3)
 
# }
# NOT RUN {
 
# }
# NOT RUN {
### Keyword Extraction
engine = worker("keywords", topn = 1)
keywords(words, engine)

### Speech Tagging 
tagger = worker("tag")
tagging(words, tagger)

### Simhash
simhasher = worker("simhash", topn = 1)
simhash(words, simhasher)
distance("hello world" , "hello world!" , simhasher)

show_dictpath()
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab

Description

Arguments

Details

References

See Also

Examples