Learn R Programming

⚠️There's a newer version (0.11.1) of this package.Take me there.

jiebaR 中文分词

Linux :  Mac :  Win :

"结巴"中文分词的R语言版本,支持多种分词模式,同时有词性标注,关键词提取,文本Simhash相似度比较等功能。项目使用了RcppCppJieba进行开发。

细胞词库转换可以使用 cidian 包 :https://github.com/qinwf/cidian/

特性

  • 支持 Windows,Linux,Mac 操作系统。
  • 通过 Rcpp 实现同时加载多个分词系统,可以分别使用不同的分词模式和词库。
  • 支持多种分词模式、中文姓名识别、关键词提取、词性标注以及文本Simhash相似度比较等功能。
  • 支持加载自定义用户词库,设置词频、词性。
  • 同时支持简体中文、繁体中文分词。
  • 支持自动判断编码模式。
  • 比原"结巴"中文分词速度快,是其他R分词包的5-20倍。
  • 安装简单,无需复杂设置。
  • 可以通过Rpy2jvmr等被其他语言调用。
  • 基于MIT协议。

安装

通过CRAN安装:

install.packages("jiebaR")
library("jiebaR")

cc = worker()
cc["这是一个测试"] # or segment("这是一个测试", cc)

# [1] "这是" "一个" "测试"

同时还可以通过Github安装开发版,建议使用 gcc >= 4.6 编译,Windows需要安装 Rtools

library(devtools)
install_github("qinwf/jiebaRD")
install_github("qinwf/jiebaR")
library("jiebaR")

使用指南 与 演示

使用指南http://qinwenfeng.com/jiebaR/

Shiny 演示https://qinwf.shinyapps.io/jiebaR-shiny/

细胞词库转换https://github.com/qinwf/cidian/

问题

使用中遇到的任何问题,都可以:

jiebaR

This is a package for Chinese text segmentation, keyword extraction and speech tagging. jiebaR supports four types of segmentation modes: Maximum Probability, Hidden Markov Model, Query Segment and Mix Segment.

Features

  • Support Windows, Linux,and Mac.
  • Using Rcpp to load different segmentation worker at the same time.
  • Support Chinese text segmentation, keyword extraction, speech tagging and simhash computation.
  • Custom dictionary path.
  • Support simplified Chinese and traditional Chinese.
  • New words identification.
  • Auto encoding detection.
  • Fast text segmentation.
  • Easy installation.
  • MIT license.

Installation

Install the latest development version from GitHub:

devtools::install_github("qinwf/jiebaR")

Install from CRAN:

install.packages("jiebaR")

Copy Link

Version

Install

install.packages('jiebaR')

Monthly Downloads

163

Version

0.8.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Qin Wenfeng

Last Published

April 18th, 2016

Functions in jiebaR (0.8.1)

filecoding

Files encoding detection
get_qsegmodel

Set quick mode model
query_threshold

Set query threshold
new_user_word

Add user word
distance

Hamming distance of words
simhash

Simhash computation
show_dictpath

Show default path of dictionaries
print.inv

Print worker settings
words_locate

Get text location
<=.qseg

Quick mode symbol
get_idf

generate IDF dict
<=.tagger

Tagger symbol
DICTPATH

The path of dictionary
<=.simhash

Simhash symbol
<=.keywords

Keywords symbol
tobin

simhash value to binary
segment

Chinese text segmentation function
freq

The frequency of words
edit_dict

Edit default user dictionary
worker

Initialize jiebaR worker
tagging

Speech Tagging
filter_segment

Filter segmentation result This function helps remove some words in the segmentation result.
<=.segment

Text segmentation symbol
keywords

Keyword extraction
get_tuple

get tuple from the segmentation result
jiebaR

A package for Chinese text segmentation