Learn R Programming

⚠️There's a newer version (0.11.1) of this package.Take me there.

jiebaR 中文分词

"结巴"中文分词的R语言版本,支持多种分词模式,同时有词性标注,关键词提取,文本Simhash相似度比较等功能。项目使用了RcppCppJieba进行开发。

细胞词库转换可以使用 cidian 包 :https://github.com/qinwf/cidian/

特性

  • 支持 Windows,Linux,Mac 操作系统。
  • 通过 Rcpp 实现同时加载多个分词系统,可以分别使用不同的分词模式和词库。
  • 支持多种分词模式、中文姓名识别、关键词提取、词性标注以及文本Simhash相似度比较等功能。
  • 支持加载自定义用户词库,设置词频、词性。
  • 同时支持简体中文、繁体中文分词。
  • 支持自动判断编码模式。
  • 比原"结巴"中文分词速度快,是其他R分词包的5-20倍。
  • 安装简单,无需复杂设置。
  • 可以通过Rpy2jvmr等被其他语言调用。
  • 基于MIT协议。

安装

通过CRAN安装:

install.packages("jiebaR")
library("jiebaR")

cc = worker()
cc["这是一个测试"] # or segment("这是一个测试", cc)

# [1] "这是" "一个" "测试"

同时还可以通过Github安装开发版,建议使用 gcc >= 4.9 编译,Windows需要安装 Rtools

library(devtools)
install_github("qinwf/jiebaRD")
install_github("qinwf/jiebaR")
library("jiebaR")

使用指南 与 演示

使用指南http://qinwenfeng.com/jiebaR/

正在撰写的文档 : https://jiebaR.qinwf.com/

Shiny 演示https://qinwf.shinyapps.io/jiebaR-shiny/

细胞词库转换https://github.com/qinwf/cidian/

问题

使用中遇到的任何问题,都可以:

jiebaR

This is a package for Chinese text segmentation, keyword extraction and speech tagging. jiebaR supports four types of segmentation modes: Maximum Probability, Hidden Markov Model, Query Segment and Mix Segment.

Features

  • Support Windows, Linux,and Mac.
  • Using Rcpp to load different segmentation worker at the same time.
  • Support Chinese text segmentation, keyword extraction, speech tagging and simhash computation.
  • Custom dictionary path.
  • Support simplified Chinese and traditional Chinese.
  • New words identification.
  • Auto encoding detection.
  • Fast text segmentation.
  • Easy installation.
  • MIT license.

Installation

Install the latest development version from GitHub:

devtools::install_github("qinwf/jiebaR")

Install from CRAN:

install.packages("jiebaR")

Copy Link

Version

Install

install.packages('jiebaR')

Monthly Downloads

163

Version

0.9.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Qin Wenfeng

Last Published

September 28th, 2016

Functions in jiebaR (0.9.1)

apply_list

Apply list input to a worker
get_tuple

get tuple from the segmentation result
get_idf

generate IDF dict
file_coding

Files encoding detection
DICTPATH

The path of dictionary
distance

Hamming distance of words
jiebaR

A package for Chinese text segmentation
freq

The frequency of words
new_user_word

Add user word
keywords

Keyword extraction
print.inv

Print worker settings
<=.keywords

Keywords symbol
query_threshold

Set query threshold
<=.qseg

Quick mode symbol
<=.segment

Text segmentation symbol
<=.tagger

Tagger symbol
<=.simhash

Simhash symbol
segment

Chinese text segmentation function
simhash

Simhash computation
tagging

Speech Tagging
get_qsegmodel

Set quick mode model
simhash_dist

Compute Hamming distance of Simhash value
vector_tag

Tag the a character vector
words_locate

Get text location
show_dictpath

Show default path of dictionaries
tobin

simhash value to binary
worker

Initialize jiebaR worker
filter_segment

Filter segmentation result
edit_dict

Edit default user dictionary
<=.keywords

Keywords symbol
<=.segment

Text segmentation symbol
<=.qseg

Quick mode symbol
<=.simhash

Simhash symbol
<=.tagger

Tagger symbol