ft_quantile_discretizer
Feature Transformation -- QuantileDiscretizer (Estimator)
ft_quantile_discretizer
takes a column with continuous features and outputs
a column with binned categorical features. The number of bins can be
set using the num_buckets
parameter. It is possible that the number
of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct
quantiles.
Usage
ft_quantile_discretizer(x, input_col = NULL, output_col = NULL,
num_buckets = 2, input_cols = NULL, output_cols = NULL,
num_buckets_array = NULL, handle_invalid = "error",
relative_error = 0.001, uid = random_string("quantile_discretizer_"),
...)
Arguments
- x
A
spark_connection
,ml_pipeline
, or atbl_spark
.- input_col
The name of the input column.
- output_col
The name of the output column.
- num_buckets
Number of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2.
- input_cols
Names of input columns.
- output_cols
Names of output columns.
- num_buckets_array
Array of number of buckets (quantiles, or categories) into which data points are grouped. Each value must be greater than or equal to 2.
- handle_invalid
(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error"
- relative_error
(Spark 2.0.0+) Relative error (see documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile here for description). Must be in the range [0, 1]. default: 0.001
- uid
A character string used to uniquely identify the feature transformer.
- ...
Optional arguments; currently unused.
Details
NaN handling: null and NaN values will be ignored from the column
during QuantileDiscretizer
fitting. This will produce a Bucketizer
model for making predictions. During the transformation, Bucketizer
will raise an error when it finds NaN values in the dataset, but the
user can also choose to either keep or remove NaN values within the
dataset by setting handle_invalid
If the user chooses to keep NaN values,
they will be handled specially and placed into their own bucket,
for example, if 4 buckets are used, then non-NaN data will be put
into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see
the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
here for a detailed description). The precision of the approximation can be
controlled with the relative_error
parameter. The lower and upper bin
bounds will be -Infinity and +Infinity, covering all real values.
Note that the result may be different every time you run it, since the sample strategy behind it is non-deterministic.
In the case where x
is a tbl_spark
, the estimator fits against x
to obtain a transformer, which is then immediately used to transform x
, returning a tbl_spark
.
Value
The object returned depends on the class of x
.
spark_connection
: Whenx
is aspark_connection
, the function returns aml_transformer
, aml_estimator
, or one of their subclasses. The object contains a pointer to a SparkTransformer
orEstimator
object and can be used to composePipeline
objects.ml_pipeline
: Whenx
is aml_pipeline
, the function returns aml_pipeline
with the transformer or estimator appended to the pipeline.tbl_spark
: Whenx
is atbl_spark
, a transformer is constructed then immediately applied to the inputtbl_spark
, returning atbl_spark
See Also
See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark.
Other feature transformers: ft_binarizer
,
ft_bucketizer
,
ft_chisq_selector
,
ft_count_vectorizer
, ft_dct
,
ft_elementwise_product
,
ft_feature_hasher
,
ft_hashing_tf
, ft_idf
,
ft_imputer
,
ft_index_to_string
,
ft_interaction
, ft_lsh
,
ft_max_abs_scaler
,
ft_min_max_scaler
, ft_ngram
,
ft_normalizer
,
ft_one_hot_encoder_estimator
,
ft_one_hot_encoder
, ft_pca
,
ft_polynomial_expansion
,
ft_r_formula
,
ft_regex_tokenizer
,
ft_sql_transformer
,
ft_standard_scaler
,
ft_stop_words_remover
,
ft_string_indexer
,
ft_tokenizer
,
ft_vector_assembler
,
ft_vector_indexer
,
ft_vector_slicer
, ft_word2vec