civis_ml: Interface for modeling in the Civis Platform

Description

An interface for training and scoring data on Civis Platform using a set of Scikit-Learn estimators.

Usage

civis_ml(x, dependent_variable, model_type, primary_key = NULL,
  excluded_columns = NULL, parameters = NULL, fit_params = NULL,
  cross_validation_parameters = NULL, calibration = NULL,
  oos_scores_table = NULL, oos_scores_db = NULL,
  oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
  model_name = NULL, cpu_requested = NULL, memory_requested = NULL,
  disk_requested = NULL, notifications = NULL, polling_interval = NULL,
  validation_data = c("train", "skip"), n_jobs = NULL, verbose = FALSE)
civis_ml_fetch_existing(model_id, run_id = NULL)
# S3 method for civis_ml
predict(object, newdata, primary_key = NA,
  output_table = NULL, output_db = NULL, if_output_exists = c("fail",
  "append", "drop", "truncate"), n_jobs = NULL, cpu_requested = NULL,
  memory_requested = NULL, disk_requested = NULL, polling_interval = NULL,
  verbose = FALSE, ...)

Arguments

x, newdata

See the Data Sources section below.

dependent_variable

The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables.

model_type

The name of the CivisML workflow. See the Workflows section below.

primary_key

Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In predict.civis_ml, the primary_key of the training task is used by default primary_key = NA. Use primary_key = NULL to explicitly indicate the data have no primary_key.

excluded_columns

Optional, a vector of columns which will be considered ineligible to be independent variables.

parameters

Optional, parameters for the final stage estimator in a predefined model, e.g. list(C = 2) for a "sparse_logistic" model.

fit_params

Optional, a mapping from parameter names in the model's fit method to the column names which hold the data, e.g. list(sample_weight = 'survey_weight_column').

cross_validation_parameters

Optional, parameter grid for learner parameters, e.g. list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3)) or "hyperband" for supported models.

calibration

Optional, if not NULL, calibrate output probabilities with the selected method, sigmoid, or isotonic. Valid only with classification models.

oos_scores_table

Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".

oos_scores_db

Optional, the name of the database where the oos_scores_table will be created. If not provided, this will default to database_name.

oos_scores_if_exists

Optional, action to take if oos_scores_table already exists. One of "fail", "append", "drop", or "truncate". The default is "fail".

model_name

Optional, the prefix of the Platform modeling jobs. It will have " Train" or " Predict" added to become the Script title.

cpu_requested

Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.

memory_requested

Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.

disk_requested

Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.

notifications

Optional, model status notifications. See scripts_post_custom for further documentation about email and URL notification.

polling_interval

Check for job completion every this number of seconds.

validation_data

Optional, source for validation data. There are currently two options: train (the default), which uses training data for validation, and skip, which skips the validation step.

n_jobs

Number of concurrent Platform jobs to use for training and validation, or multi-file / large table prediction.

verbose

Optional, If TRUE, supply debug outputs in Platform logs and make prediction child jobs visible.

model_id

The id of CivisML model built previously.

run_id

Optional, the id of a CivisML model run. If NULL, defaults to fetching the latest run.

object

A civis_ml object.

output_table

The table in which to put predictions.

output_db

The database containing output_table. If not provided, this will default to the database_name specified when the model was built.

if_output_exists

Action to take if the prediction table already exists. One of "fail", "append", "drop", or "truncate". The default is "fail".

…

Unused

Value

A civis_ml object, a list containing the following elements:

job

job metadata from scripts_get_custom.

run

run metadata from scripts_get_custom_runs.

outputs

CivisML metadata from scripts_list_custom_runs_outputs containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.

metrics

Parsed CivisML output from metrics.json containing metadata from validation. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.

model_info

Parsed CivisML output from model_info.json containing metadata from training. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.

CivisML Workflows

You can use the following pre-defined models with civis_ml. All models start by imputing missing values with the mean of non-null values in a column. The "sparse_*" models include a LASSO regression step (using glmnet) to do feature selection before passing data to the final model. In some models, CivisML uses default parameters from those in Scikit-Learn, as indicated in the "Altered Defaults" column. All models also have random_state=42. Note that "multilayer_perceptron_classifier" and "multilayer_perceptron_regressor" can only be used with hyperband.

Specific workflows can also be called directly using the R workflow functions.

Name	R Workflow	Model Type	Algorithm	Altered Defaults
`sparse_logistic`	`civis_ml_sparse_logistic`	classification	LogisticRegression	`C=499999950, tol=1e-08`
`gradient_boosting_classifier`	`civis_ml_gradient_boosting_classifier`	classification	GradientBoostingClassifier	`n_estimators=500, max_depth=2`
`random_forest_classifier`	`civis_ml_random_forest_classifier`	classification	RandomForestClassifier	`n_estimators=500`
`extra_trees_classifier`	`civis_ml_extra_trees_classifier`	classification	ExtraTreesClassifier	`n_estimators=500`
`multilayer_perceptron_classifier`		classification	MLPClassifier
`stacking_classifier`		classification	StackedClassifier
`sparse_linear_regressor`	`civis_ml_sparse_linear_regressor`	regression	LinearRegression
`sparse_ridge_regressor`	`civis_ml_sparse_ridge_regressor`	regression	Ridge
`gradient_boosting_regressor`	`civis_ml_gradient_boosting_regressor`	regression	GradientBoostingRegressor	`n_estimators=500, max_depth=2`
`random_forest_regressor`	`civis_ml_random_forest_regressor`	regression	RandomForestRegressor	`n_estimators=500`
`extra_trees_regressor`	`civis_ml_extra_trees_regressor`	regression	ExtraTreesRegressor	`n_estimators=500`
`multilayer_perceptron_regressor`		regression	MLPRegressor
`stacking_regressor`		regression	StackedRegressor

Model names can be easily accessed using the global variables CIVIS_ML_REGRESSORS and CIVIS_ML_CLASSIFIERS.

Stacking

The "stacking_classifier" model stacks together the "sparse_logistic", "gradient_boosting_classifier", and "random_forest_classifier" models, using altered defaults as listed for each in the "Altered Defaults" column of the table above. The models are combined using a pipeline containing a Normalizer step, followed by a LogisticRegressionCV with penalty='l2' and tol=1e-08. The "stacking_regressor" works similarly, stacking together the "sparse_linear_regressor", "gradient_boosting_regressor", and "random_forest_regressor" models, and combining them using NonNegativeLinearRegression.

Hyperparameter Tuning

You can tune hyperparameters using one of two methods: grid search or hyperband. CivisML will perform grid search if you pass a list of hyperparameters to the cross_validation_parameters parameter, where list elements are hyperparameter names, and the values are vectors of hyperparameter values to grid search over. You can run hyperparameter optimization in parallel by setting the n_jobs parameter to however many jobs you would like to run in parallel. n_jobs defaults to 1 (no parallelization).

Hyperband is an efficient approach to hyperparameter optimization, and recommended over grid search where possible. CivisML will perform hyperband optimization if you pass the string "hyperband" to cross_validation_parameters. Hyperband is currently only supported for the following models: "gradient_boosting_classifier", "random_forest_classifier", "extra_trees_classifier", "multilayer_perceptron_classifier", "gradient_boosting_regressor", "random_forest_regressor", "extra_trees_regressor", and "multilayer_perceptron_regressor".

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame: civis_ml(x = df, ...)
local csv file: civis_ml(x = "path/to/data.csv", ...)
file in Civis Platform: civis_ml(x = civis_file(1234))
table in Civis Platform: civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

Out of sample scores

Model outputs will always contain out-of-sample (or out of fold) scores, which are accessible through fetch_oos_scores. These may be stored in a Civis table on Redshift using the oos_scores, oos_scores_db, and oos_scores_if_exists parameters.

Predictions

A fitted model can be used to make predictions for data residing in any of the sources above and a civis_file_manifest. Similar to civis_ml, use the data source helpers as the newdata argument to predict.civis_ml.

A manifest file is a JSON file which specifies the location of many shards of the data to be used for prediction. A manifest file is the output of a Civis export job with force_multifile = TRUE set, e.g. from civis_to_multifile_csv. Large civis tables (provided using table_name) will automatically be exported to manifest files.

Prediction outputs will always be stored as gzipped CSVs in one or more civis files. Provide an output_table (and optionally an output_db, if it's different from database_name) to copy these predictions into a table on Redshift.

Examples

Run this code

# NOT RUN {
# From a data frame:
m <- civis_ml(df, model_type = "sparse_logistic",
              dependent_variable = "Species")

# From a table:
m <- civis_ml(civis_table("schema.table", "database_name"),
              model_type = "sparse_logistic", dependent_variable = "Species",
              oos_scores_table = "schema.scores_table",
              oos_scores_if_exists = "drop")

# From a local file:
m <- civis_ml("path/to/file.csv", model_type = "sparse_logistic",
              dependent_variable = "Species")

# From a Civis file:
file_id <- write_civis_file("path/to/file.csv", name = "file.csv")
m <- civis_ml(civis_file(file_id), model_type = "sparse_logistic",
              dependent_variable = "Species")

pred_job <- predict(m, newdata = df)
pred_job <- predict(m, civis_table("schema.table", "database_name"),
                    output_table = "schema.scores_table")
pred_job <- predict(m, civis_file(file_id),
                    output_table = "schema.scores_table")

m <- civis_ml_fetch_existing(model_id = m$job$id, m$run$id)
logs <- fetch_logs(m)
yhat <- fetch_oos_scores(m)
yhat <- fetch_predictions(pred_job)
# }

Run the code above in your browser using DataLab