CTDM: Term Document Matrix

Description

Constructs Term-Document Matrix from Chinese Text Documents.

Usage

CTDM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE, shortTermDeleted = TRUE)

Arguments

doc

The Chinese text document. A vector of Chinese strings.

weighting

Available weighting function with matrix are binary, count, tf, tfidf. See details.

EngTermDeleted

remove English from text documents.

NumTermDeleted

remove Numbers from text documents.

shortTermDeleted

Deltected short word when nchar

Details

This function run a Chinese word segmentation by jiebeR and build term-document matrix, and there is four weighting function with matrix, and "binary" means value can only be 1 if the term occurs, "count" means how many times the term occurs in a doc, "tf" means term frequency and "tfidf" means term frequency inverse document frequency.

Examples

Run this code

library(CTM)
a1 <- "hello taiwan"
b1 <- "world of tank"
c1 <- "taiwan weather"
d1 <- "local weather"
text1 <- t(data.frame(a1,b1,c1,d1))
tdm1 <- CTDM(doc = text1, weighting = "tfidf", EngTermDeleted = FALSE, shortTermDeleted = FALSE)

Run the code above in your browser using DataLab