ft_chisq_selector

A <code>spark_connection</code>, <code>ml_pipeline</code>, or a <code>tbl_spark</code>.

Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by <code><a rd-options="" href="/link/ft_r_formula?package=sparklyr&version=1.4.0" data-mini-rdoc="sparklyr::ft_r_formula">ft_r_formula</a></code>.

features_col

output_col

Label column name. The column should be a numeric column. Usually this column is output by <code><a rd-options="" href="/link/ft_r_formula?package=sparklyr&version=1.4.0" data-mini-rdoc="sparklyr::ft_r_formula">ft_r_formula</a></code>.

label_col

(Spark 2.1.0+) The selector type of the ChisqSelector. Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe".

selector_type

(Spark 2.2.0+) The upper bound of the expected false discovery rate. Only applicable when selector_type = "fdr". Default value is 0.05.

(Spark 2.1.0+) The highest p-value for features to be kept. Only applicable when selector_type= "fpr". Default value is 0.05.

(Spark 2.2.0+) The upper bound of the expected family-wise error rate. Only applicable when selector_type = "fwe". Default value is 0.05.

Number of features that selector will select, ordered by ascending p-value. If the number of features is less than <code>num_top_features</code>, then this will select all features. Only applicable when selector_type = "numTopFeatures". The default value of <code>num_top_features</code> is 50.

num_top_features

(Spark 2.1.0+) Percentile of features that selector will select, ordered by statistics value descending. Only applicable when selector_type = "percentile". Default value is 0.1.

percentile

A character string used to uniquely identify the feature transformer.

Optional arguments; currently unused.

Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label

R interface to Apache Spark, a fast and general engine for big data
processing, see <http://spark.apache.org>. This package supports connecting to
local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end,
and provides an interface to Spark's built-in machine learning algorithms.

Yitao Li

sparklyr

R Interface to Apache Spark

Javier Luraschi

Kevin Kuo

Kevin Ushey

JJ Allaire

Samuel Macedo

Hossein Falaki

Lu Wang

Andy Zhang

 RStudio

 The Apache Software Foundation

ft_chisq_selector function

Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by <code><a rd-options='' href='ft_r_formula'>ft_r_formula</a></code>.

Label column name. The column should be a numeric column. Usually this column is output by <code><a rd-options='' href='ft_r_formula'>ft_r_formula</a></code>.

ft_chisq_selector: Feature Transformation -- ChiSqSelector (Estimator)

Description

Usage

Arguments

Value

Details

See Also