Jan Wijffels

Jan Wijffels

31 packages on CRAN

BTM

cran
99.99th

Percentile

Biterm Topic Models find topics in collections of short texts. It is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns which are called biterms. This in contrast to traditional topic models like Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis which are word-document co-occurrence topic models. A biterm consists of two words co-occurring in the same short text window. This context window can for example be a twitter message, a short answer on a survey, a sentence of a text or a document identifier. The techniques are explained in detail in the paper 'A Biterm Topic Model For Short Text' by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng (2013) <https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf>.

crfsuite

cran
99.99th

Percentile

Wraps the 'CRFsuite' library <https://github.com/chokkan/crfsuite> allowing users to fit a Conditional Random Field model and to apply it on existing data. The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind. Next to training, a small web application is included in the package to allow you to easily construct training data.

cronR

cran
99.99th

Percentile

Create, edit, and remove 'cron' jobs on your unix-alike system. The package provides a set of easy-to-use wrappers to 'crontab'. It also provides an RStudio add-in to easily launch and schedule your scripts.

dlib

cran
99.99th

Percentile

Interface for 'Rcpp' users to 'dlib' <http://dlib.net> which is a 'C++' toolkit containing machine learning algorithms and computer vision tools. It is used in a wide range of domains including robotics, embedded devices, mobile phones, and large high performance computing environments. This package allows R users to use 'dlib' through 'Rcpp'.

ETLUtils

cran
99.99th

Percentile

Provides functions to facilitate the use of the 'ff' package in interaction with big data in 'SQL' databases (e.g. in 'Oracle', 'MySQL', 'PostgreSQL', 'Hive') by allowing easy importing directly into 'ffdf' objects using 'DBI', 'RODBC' and 'RJDBC'. Also contains some basic utility functions to do fast left outer join merging based on 'match', factorisation of data and a basic function for re-coding vectors.

99.99th

Percentile

Improve optical character recognition by binarizing images. The package focuses primarily on local adaptive thresholding algorithms. In English, this means that it has the ability to turn a color or gray scale image into a black and white image. This is particularly useful as a preprocessing step for optical character recognition or handwritten text recognition.

99.99th

Percentile

An implementation of the Canny Edge Detector for detecting edges in images. The package provides an interface to the algorithm available at <https://github.com/Neseb/canny>.

99.99th

Percentile

An implementation of the Unsupervised Smooth Contour Detection algorithm for digital images as described in the paper: "Unsupervised Smooth Contour Detection" by Rafael Grompone von Gioi, and Gregory Randall (2016). The algorithm is explained at <doi:10.5201/ipol.2016.175>.

99.99th

Percentile

An implementation of the "FAST-9" corner detection algorithm explained in the paper 'FASTER and better: A machine learning approach to corner detection' by Rosten E., Porter R. and Drummond T. (2008), available at <arXiv:0810.2434>. The package allows to detect corners in digital images.

99.99th

Percentile

An implementation of the Harris Corner Detection as described in the paper "An Analysis and Implementation of the Harris Corner Detector" by S<c3><a1>nchez J. et al (2018) available at <doi:10.5201/ipol.2018.229>. The package allows to detect relevant points in images which are characteristic to the digital image.

image.dlib

cran
99.99th

Percentile

Facility wrappers around the image processing functionality of 'dlib'. 'Dlib' <http://dlib.net> is a 'C++' toolkit containing machine learning algorithms and computer vision tools. Currently the package allows to find feature descriptors of digital images, in particular 'SURF' and 'HOG' descriptors.

99.99th

Percentile

An open source library for face detection in images. Provides a pretrained convolutional neural network based on <https://github.com/ShiqiYu/libfacedetection> which can be used to detect faces which have size greater than 10x10 pixels.

99.99th

Percentile

An implementation of the Line Segment Detector on digital images described in the paper: "LSD: A Fast Line Segment Detector with a False Detection Control" by Rafael Grompone von Gioi et al (2012). The algorithm is explained at <doi:10.5201/ipol.2012.gjmr-lsd>.

image.Otsu

cran
99.99th

Percentile

An implementation of the Otsu's Image Segmentation Method described in the paper: "A C++ Implementation of Otsu's Image Segmentation Method". The algorithm is explained at <doi:10.5201/ipol.2016.158>.

Myrrix

cran
99.99th

Percentile

Recommendation engine based on 'Myrrix'. 'Myrrix' is a complete, real-time, scalable clustering and recommender system, evolved from 'Apache Mahout'. It uses Alternating Least Squares to build a recommendation engine.

Myrrixjars

cran
99.99th

Percentile

External jars required for package 'Myrrix'. 'Myrrix' is a recommendation engine.

nametagger

cran
99.99th

Percentile

Wraps the 'nametag' library <https://github.com/ufal/nametag>, allowing users to find and extract entities (names, persons, locations, addresses, ...) in raw text and build your own entity recognition models. Based on a maximum entropy Markov model which is described in Strakova J., Straka M. and Hajic J. (2013) <http://ufal.mff.cuni.cz/~straka/papers/2013-tsd_ner.pdf>.

RMOA

cran
99.99th

Percentile

Connect R with MOA (Massive Online Analysis - <http://moa.cms.waikato.ac.nz>) to build classification models and regression models on streaming data or out-of-RAM data. Also streaming recommendation models are made available.

RMOAjars

cran
99.99th

Percentile

External jars required for package RMOA. RMOA is a framework to build data stream models on top of MOA (Massive Online Analysis - <http://moa.cms.waikato.ac.nz>). The jar files are put in this R package, the modelling logic can be found in the RMOA package.

ruimtehol

cran
99.99th

Percentile

Wraps the 'StarSpace' library <https://github.com/facebookresearch/StarSpace> allowing users to calculate word, sentence, article, document, webpage, link and entity 'embeddings'. By using the 'embeddings', you can perform text based multi-label classification, find similarities between texts and categories, do collaborative-filtering based recommendation as well as content-based recommendation, find out relations between entities, calculate graph 'embeddings' as well as perform semi-supervised learning and multi-task learning on plain text. The techniques are explained in detail in the paper: 'StarSpace: Embed All The Things!' by Wu et al. (2017), available at <arXiv:1709.03856>.

99.99th

Percentile

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

99.99th

Percentile

Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark' from R. 'Apache Spark' is an open source cluster computing framework available at <http://spark.apache.org>. This R package uses the 'spark-sas7bdat' 'Spark' package (<https://spark-packages.org/package/saurfang/spark-sas7bdat>) to import and process 'SAS' data in parallel using 'Spark'. Hereby allowing to execute 'dplyr' statements in parallel on top of 'SAS' data.

99.99th

Percentile

Schedule R scripts/processes with the Windows task scheduler. This allows R users to automate R processes on specific time points from R itself.

99.99th

Percentile

Find similarities between texts using the Smith-Waterman algorithm. The algorithm performs local sequence alignment and determines similar regions between two strings. The Smith-Waterman algorithm is explained in the paper: "Identification of common molecular subsequences" by T.F.Smith and M.S.Waterman (1981), available at <doi:10.1016/0022-2836(81)90087-5>. This package implements the same logic for sequences of words and letters instead of molecular sequences.

textplot

cran
99.99th

Percentile

Visualise complex relations in texts. This is done by providing functionalities for displaying text co-occurrence networks, text correlation networks, dependency relationships as well as text clustering. Feel free to join the effort of providing interesting text visualisations.

textrank

cran
99.99th

Percentile

The 'textrank' algorithm is an extension of the 'Pagerank' algorithm for text. The algorithm allows to summarize text by calculating how sentences are related to one another. This is done by looking at overlapping terminology used in sentences in order to set up links between sentences. The resulting sentence network is next plugged into the 'Pagerank' algorithm which identifies the most important sentences in your text and ranks them. In a similar way 'textrank' can also be used to extract keywords. A word network is constructed by looking if words are following one another. On top of that network the 'Pagerank' algorithm is applied to extract relevant words after which relevant words which are following one another are combined to get keywords. More information can be found in the paper from Mihalcea, Rada & Tarau, Paul (2004) <http://www.aclweb.org/anthology/W04-3252>.

99.99th

Percentile

Unsupervised text tokenizer focused on computational efficiency. Wraps the 'YouTokenToMe' library <https://github.com/VKCOM/YouTokenToMe> which is an implementation of fast Byte Pair Encoding (BPE) <https://www.aclweb.org/anthology/P16-1162>.

udpipe

cran
99.99th

Percentile

This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <http://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>.

word2vec

cran
99.99th

Percentile

Learn vector representations of words by continuous bag of words and skip-gram implementations of the 'word2vec' algorithm. The techniques are detailed in the paper "Distributed Representations of Words and Phrases and their Compositionality" by Mikolov et al. (2013), available at <arXiv:1310.4546>.

ffbase

cran
99.99th

Percentile

Extends the out of memory vectors of 'ff' with statistical functions and other utilities to ease their usage.

imager

cran
99.99th

Percentile

Fast image processing for images in up to 4 dimensions (two spatial dimensions, one time/depth dimension, one colour dimension). Provides most traditional image processing tools (filtering, morphology, transformations, etc.) as well as various functions for easily analysing image data using R. The package wraps 'CImg', <http://cimg.eu>, a simple, modern C++ library for image processing.