# ft_quantile_discretizer

##### Feature Transformation -- QuantileDiscretizer (Estimator)

`ft_quantile_discretizer`

takes a column with continuous features and outputs
a column with binned categorical features. The number of bins can be
set using the `num_buckets`

parameter. It is possible that the number
of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct
quantiles.

##### Usage

```
ft_quantile_discretizer(x, input_col = NULL, output_col = NULL,
num_buckets = 2L, input_cols = NULL, output_cols = NULL,
num_buckets_array = NULL, handle_invalid = "error",
relative_error = 0.001, dataset = NULL,
uid = random_string("quantile_discretizer_"), ...)
```

##### Arguments

- x
A

`spark_connection`

,`ml_pipeline`

, or a`tbl_spark`

.- input_col
The name of the input column.

- output_col
The name of the output column.

- num_buckets
Number of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2.

- input_cols
Names of input columns.

- output_cols
Names of output columns.

- num_buckets_array
Array of number of buckets (quantiles, or categories) into which data points are grouped. Each value must be greater than or equal to 2.

- handle_invalid
(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error"

- relative_error
(Spark 2.0.0+) Relative error (see documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile here for description). Must be in the range [0, 1]. default: 0.001

- dataset
(Optional) A

`tbl_spark`

. If provided, eagerly fit the (estimator) feature "transformer" against`dataset`

. See details.- uid
A character string used to uniquely identify the feature transformer.

- ...
Optional arguments; currently unused.

##### Details

NaN handling: null and NaN values will be ignored from the column
during `QuantileDiscretizer`

fitting. This will produce a `Bucketizer`

model for making predictions. During the transformation, `Bucketizer`

will raise an error when it finds NaN values in the dataset, but the
user can also choose to either keep or remove NaN values within the
dataset by setting `handle_invalid`

If the user chooses to keep NaN values,
they will be handled specially and placed into their own bucket,
for example, if 4 buckets are used, then non-NaN data will be put
into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see
the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
here for a detailed description). The precision of the approximation can be
controlled with the `relative_error`

parameter. The lower and upper bin
bounds will be -Infinity and +Infinity, covering all real values.

Note that the result may be different every time you run it, since the sample strategy behind it is non-deterministic.

When `dataset`

is provided for an estimator transformer, the function
internally calls `ml_fit()`

against `dataset`

. Hence, the methods for
`spark_connection`

and `ml_pipeline`

will then return a `ml_transformer`

and a `ml_pipeline`

with a `ml_transformer`

appended, respectively. When
`x`

is a `tbl_spark`

, the estimator will be fit against `dataset`

before
transforming `x`

.

When `dataset`

is not specified, the constructor returns a `ml_estimator`

, and,
in the case where `x`

is a `tbl_spark`

, the estimator fits against `x`

then
to obtain a transformer, which is then immediately used to transform `x`

.

##### Value

The object returned depends on the class of `x`

.

`spark_connection`

: When`x`

is a`spark_connection`

, the function returns a`ml_transformer`

, a`ml_estimator`

, or one of their subclasses. The object contains a pointer to a Spark`Transformer`

or`Estimator`

object and can be used to compose`Pipeline`

objects.`ml_pipeline`

: When`x`

is a`ml_pipeline`

, the function returns a`ml_pipeline`

with the transformer or estimator appended to the pipeline.`tbl_spark`

: When`x`

is a`tbl_spark`

, a transformer is constructed then immediately applied to the input`tbl_spark`

, returning a`tbl_spark`

##### See Also

See http://spark.apache.org/docs/latest/ml-features.html for more information on the set of transformations available for DataFrame columns in Spark.

Other feature transformers: `ft_binarizer`

,
`ft_bucketizer`

,
`ft_chisq_selector`

,
`ft_count_vectorizer`

, `ft_dct`

,
`ft_elementwise_product`

,
`ft_feature_hasher`

,
`ft_hashing_tf`

, `ft_idf`

,
`ft_imputer`

,
`ft_index_to_string`

,
`ft_interaction`

, `ft_lsh`

,
`ft_max_abs_scaler`

,
`ft_min_max_scaler`

, `ft_ngram`

,
`ft_normalizer`

,
`ft_one_hot_encoder`

, `ft_pca`

,
`ft_polynomial_expansion`

,
`ft_r_formula`

,
`ft_regex_tokenizer`

,
`ft_sql_transformer`

,
`ft_standard_scaler`

,
`ft_stop_words_remover`

,
`ft_string_indexer`

,
`ft_tokenizer`

,
`ft_vector_assembler`

,
`ft_vector_indexer`

,
`ft_vector_slicer`

, `ft_word2vec`

*Documentation reproduced from package sparklyr, version 0.8.2, License: Apache License 2.0 | file LICENSE*