Learn R Programming

CTM (version 0.2)

CTDM: Term Document Matrix

Description

Constructs Term-Document Matrix from Chinese Text Documents.

Usage

CTDM(doc, weighting, EngTermDeleted = TRUE, NumTermDeleted = TRUE, shortTermDeleted = TRUE)

Arguments

doc
The Chinese text document. A vector of Chinese strings.
weighting
Available weighting function with matrix are binary, count, tf, tfidf. See details.
EngTermDeleted
remove English from text documents.
NumTermDeleted
remove Numbers from text documents.
shortTermDeleted
Deltected short word when nchar

Details

This function run a Chinese word segmentation by jiebeR and build term-document matrix, and there is four weighting function with matrix, and "binary" means value can only be 1 if the term occurs, "count" means how many times the term occurs in a doc, "tf" means term frequency and "tfidf" means term frequency inverse document frequency.

Examples

Run this code
library(CTM)
a1 <- "hello taiwan"
b1 <- "world of tank"
c1 <- "taiwan weather"
d1 <- "local weather"
text1 <- t(data.frame(a1,b1,c1,d1))
tdm1 <- CTDM(doc = text1, weighting = "tfidf", EngTermDeleted = FALSE, shortTermDeleted = FALSE)

Run the code above in your browser using DataLab