textTinyR package - RDocumentation

Learn R Programming

⚠️There's a newer version (1.1.8) of this package.Take me there.

textTinyR

The textTinyR package consists of text processing functions for small or big data files. More details on the functionality of textTinyR can be found in the package Documentation and Vignettes. The R package can be installed, in the following Operating Systems: Linux, Mac and Windows. However, there is one limitation : chinese, japanese, korean, thai or languages with ambiguous word boundaries are not supported.

UPDATE 01-04-2018 : boost-locale is no longer a system requirement for the textTinyR package.

Installation of the textTinyR package (CRAN, Github)

To install the package from CRAN use,


install.packages('textTinyR')

and to download the latest version from Github use the install_github function of the devtools package,


devtools::install_github(repo = 'mlampros/textTinyR')

https://github.com/mlampros/textTinyR/issues

Copy Link

Version

Install

install.packages('textTinyR')

Monthly Downloads

1,052

Version

1.1.2

License

GPL-3

Issues

Pull Requests

Stars

Forks

Repository

https://github.com/mlampros/textTinyR

Maintainer

Lampros Mouselimis

Last Published

July 25th, 2018

Functions in textTinyR (1.1.2)

Cosine similarity for text documents

token statistics

tokenize_transform_text

String tokenization and transformation ( character string or path to a file )

text_file_parser

text file parser

intersection of words or letters in tokenized text

TEXT_DOC_DISSIM

Dissimilarity calculation of text documents

big_tokenize_transform

String tokenization and transformation for big data sets

cosine_distance

cosine distance of two character strings (each string consists of more than one words)

convert a dense matrix to a sparse matrix

matrix_sparsity

sparsity percentage of a sparse matrix

read_characters

read a specific number of characters from a text file

select_predictors

Exclude highly correlated predictors

RowMens and colMeans for a sparse matrix

vocabulary_parser

returns the vocabulary counts for small or medium ( xml and not only ) files

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Jaccard or Dice similarity for text documents

bytes_converter

bytes converter of a text file ( KB, MB or GB )

dice similarity of words using n-grams

cluster_frequency

Frequencies of an existing cluster object

read a specific number of rows from a text file

dims_of_word_vecs

dimensions of a word vectors file

save_sparse_binary

save a sparse matrix in binary format

RowSums and colSums for a sparse matrix

sparse_term_matrix

Term matrices and statistics ( document-term-matrix, term-document-matrix)

tokenize_transform_vec_docs

String tokenization and transformation ( vector of documents )

utf-locale for the available languages

Number of rows of a file

levenshtein_distance

levenshtein distance of two words

load_sparse_binary

load a sparse matrix in binary format