Learn R Programming

⚠️There's a newer version (1.1.8) of this package.Take me there.

textTinyR

The textTinyR package consists of text pre-processing functions for small or big data files. More details on the functionality of the textTinyR can be found in the package Vignette. The R package can be installed, in the following OS's: Linux, Mac and Windows. However, there are some limitations :

  • there is no support for chinese, japanese, korean, thai or languages with ambiguous word boundaries.
  • there is no support functions for utf-locale on windows, meaning only english character strings or files can be input and pre-processed.

System Requirements ( for unix OS's )

Debian/Ubuntu

sudo apt-get install libboost-all-dev

sudo apt-get update

sudo apt-get install libboost-locale-dev

Fedora

yum install boost-devel

Macintosh OSX/brew

UPDATE 25-05-2017 : The current CRAN version of the package can only be installed on Linux and Windows. If the boost locale are installed properly on your OSystem use the devtools::install_github(repo = 'mlampros/textTinyR', clean = TRUE) function to download the textTinyR package.

The boost library will be installed on the Macintosh OSx using the Homebrew package manager,

If the boost library is already installed using brew install boost then it must be removed using the following command,

brew uninstall boost

Then the formula for the boost library should be modified using a text editor (TextEdit, TextMate, etc). The formula on a Macintosh OS Sierra is saved in:

/usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/Formula/boost.rb

The user should open the boost.rb formula and replace the following code chunk beginning from (approx.) line 71,


# layout should be synchronized with boost-python
args = ["--prefix=#{prefix}",
        "--libdir=#{lib}",
        "-d2",
        "-j#{ENV.make_jobs}",
        "--layout=tagged",
        "--user-config=user-config.jam",
        "install"]

if build.with? "single"
  args << "threading=multi,single"
else
  args << "threading=multi"
end

with the following code chunk,


# layout should be synchronized with boost-python
args = ["--prefix=#{prefix}",
        "--libdir=#{lib}",
        "-d2",
        "-j#{ENV.make_jobs}",
        "--layout=system", 
        "--user-config=user-config.jam",
        "threading=multi",
        "install"]

#if build.with? "single"
#  args << "threading=multi,single"
#else
#  args << "threading=multi"
#end

Then the user should save the changes, close the file and run,

brew update

to apply the changes.

Then he/she should open a new terminal (console) and type the following command, which installs the boost library using the modified formula from source, (warning: there are two dashes before : build-from-source)

brew install /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/Formula/boost.rb --build-from-source

That's it.

Installation of the textTinyR package (CRAN, Github)

To install the package from CRAN use,


install.packages('textTinyR', clean = TRUE)

and to download the latest version from Github use the install_github function of the devtools package,


devtools::install_github(repo = 'mlampros/textTinyR', clean = TRUE)

https://github.com/mlampros/textTinyR/issues

Copy Link

Version

Install

install.packages('textTinyR')

Monthly Downloads

359

Version

1.0.9

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Lampros Mouselimis

Last Published

January 16th, 2018

Functions in textTinyR (1.0.9)

cosine_distance

cosine distance of two character strings (each string consists of more than one words)
dense_2sparse

convert a dense matrix to a sparse matrix
dice_distance

dice similarity of words using n-grams
levenshtein_distance

levenshtein distance of two words
load_sparse_binary

load a sparse matrix in binary format
matrix_sparsity

sparsity percentage of a sparse matrix
big_tokenize_transform

String tokenization and transformation for big data sets
bytes_converter

bytes converter of a text file ( KB, MB or GB )
read_characters

read a specific number of characters from a text file
read_rows

read a specific number of rows from a text file
text_file_parser

text file parser
save_sparse_binary

save a sparse matrix in binary format
utf_locale

utf-locale for the available languages
sparse_Sums

RowSums and colSums for a sparse matrix
sparse_Means

RowMens and colMeans for a sparse matrix
sparse_term_matrix

Term matrices and statistics ( document-term-matrix, term-document-matrix)
token_stats

token statistics
vocabulary_parser

returns the vocabulary counts for small or medium ( xml and not only ) files
tokenize_transform_text

String tokenization and transformation ( character string or path to a file )
tokenize_transform_vec_docs

String tokenization and transformation ( vector of documents )