spark_pipeline_stage: Create a Pipeline Stage Object

Description

Helper function to create pipeline stage objects with common parameter setters.

Usage

spark_pipeline_stage(sc, class, uid, features_col = NULL,
  label_col = NULL, prediction_col = NULL, probability_col = NULL,
  raw_prediction_col = NULL, k = NULL, max_iter = NULL,
  seed = NULL, input_col = NULL, input_cols = NULL,
  output_col = NULL, output_cols = NULL)

Arguments

A `spark_connection` object.

class

Class name for the pipeline stage.

uid

A character string used to uniquely identify the ML estimator.

features_col

Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.

label_col

Label column name. The column should be a numeric column. Usually this column is output by ft_r_formula.

prediction_col

Prediction column name.

probability_col

Column name for predicted class conditional probabilities.

raw_prediction_col

Raw prediction (a.k.a. confidence) column name.

The number of clusters to create

max_iter

The maximum number of iterations to use.

seed

A random seed. Set this value if you need your results to be reproducible across repeated calls.

input_col

The name of the input column.

input_cols

Names of input columns.

output_col

The name of the output column.

thresholds

Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.

features_col

Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by ft_r_formula.

input_cols

Names of output columns.