ft_chisq_selector

A <code>spark_connection</code>, <code>ml_pipeline</code>, or a <code>tbl_spark</code>.

Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by <code><a rd-options="" href="/link/ft_r_formula?package=sparklyr&version=1.5.0" data-mini-rdoc="sparklyr::ft_r_formula">ft_r_formula</a></code>.

features_col

output_col

Label column name. The column should be a numeric column. Usually this column is output by <code><a rd-options="" href="/link/ft_r_formula?package=sparklyr&version=1.5.0" data-mini-rdoc="sparklyr::ft_r_formula">ft_r_formula</a></code>.

label_col

(Spark 2.1.0+) The selector type of the ChisqSelector. Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe".

selector_type

(Spark 2.2.0+) The upper bound of the expected false discovery rate. Only applicable when selector_type = "fdr". Default value is 0.05.

(Spark 2.1.0+) The highest p-value for features to be kept. Only applicable when selector_type= "fpr". Default value is 0.05.

(Spark 2.2.0+) The upper bound of the expected family-wise error rate. Only applicable when selector_type = "fwe". Default value is 0.05.

Number of features that selector will select, ordered by ascending p-value. If the number of features is less than <code>num_top_features</code>, then this will select all features. Only applicable when selector_type = "numTopFeatures". The default value of <code>num_top_features</code> is 50.

num_top_features

(Spark 2.1.0+) Percentile of features that selector will select, ordered by statistics value descending. Only applicable when selector_type = "percentile". Default value is 0.1.

percentile

A character string used to uniquely identify the feature transformer.

Optional arguments; currently unused.

Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label

R interface to Apache Spark, a fast and general engine for big data
processing, see <http://spark.apache.org>. This package supports connecting to
local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end,
and provides an interface to Spark's built-in machine learning algorithms.

Yitao Li

sparklyr

R Interface to Apache Spark

Javier Luraschi

Kevin Kuo

Kevin Ushey

JJ Allaire

Samuel Macedo

Hossein Falaki

Lu Wang

Andy Zhang

Jozef Hajnala

Maciej Szymkiewicz

Wil Davis

 RStudio

 The Apache Software Foundation

ft_chisq_selector function

Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by <code><a rd-options='' href='ft_r_formula'>ft_r_formula</a></code>.

Label column name. The column should be a numeric column. Usually this column is output by <code><a rd-options='' href='ft_r_formula'>ft_r_formula</a></code>.

ft_chisq_selector: Feature Transformation -- ChiSqSelector (Estimator)

Description

Usage

Arguments

Value

Details

See Also