ft_regex_tokenizer

An object (usually a <code>spark_tbl</code>) coercable to a Spark DataFrame.

input.col

output.col

The regular expression pattern to be used.

pattern

Optional arguments; currently unused.

A regex based tokenizer that extracts tokens either by using the provided
regex pattern to split the text (default) or repeatedly matching the regex
(if gaps is false). Optional parameters also allow filtering tokens using a
minimal length. It returns an array of strings that can be empty.

R interface to Apache Spark, a fast and general engine for big data
processing, see <http://spark.apache.org>. This package supports connecting to
local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end,
and provides an interface to Spark's built-in machine learning algorithms.

Javier Luraschi

ft_regex_tokenizer: Feature Tranformation -- RegexTokenizer

Description

Usage

Arguments

See Also