Background
Automated Machine Learning - In my view, AutoML should consist of functions to help make professional model development and operationalization more efficient. Most ML projects include at least one of the following: data wrangling, feature engineering, feature selection, model development, model evaluation, model interpretation, model optimization, and model operationalization. The functions in this package have been tested across a variety of industries and have consistently out-performd "state of the art" deep learning methods. I've watched coworkers spend months tuning and reconfiguring deep learning models just to have them lose to the functions here, in a matter of a day or two. My recommendation is to first utilize the functions here to establish a legit baseline performance. Then go and test out all the other methods.
Package Details
Supervised Learning - Currently, I'm utilizing CatBoost, XGBoost, and H2O for all of the automated Machine Learning related functions. GPU's can be utilized with CatBoost and XGBoost. Multi-armed bandit grid tuning is available for CatBoost and XGBoost models, which utilize the concept of randomized probability matching, which is detailed in the R pacakge "bandit".
Time series forecasting - Automated functions for single series, panel data, vector autoregression, intermittent demand, and cohort panel data. The panel data models utilize the machine learning algos from above and the feature engineering functions below. They are extremely feature rich and the combination of all possible feature settings is huge. The models for individual series are fully optimized versions from the R package "forecast". I utilize the multi-armed bandit grid tuning algo used in the supervised learning models and apply it to the SARIMA and NNETAR models from the forecast package. I also measure performance on hold out data (and training data, or a blend of the two).
Feature Engineering - Some of the feature engineering functions you can only find in this package, such as the AutoLagRollStats() and AutoLagRollStatsScoring() functions. You could classify the above functions into several buckets: categorical encoding, target encoding, and distributed lag. You can generate any number of discontiguous lags and rolling statistics (mean, sd, skewness, kurtosis, and every 5th percentile) along with time between records and their associated lags and rolling statistics for transactional level data. The function runs extremely fast if you don't utilize rolling stats other than mean (I still use data.table::frollapply() but the data.table guys admit it isn't optimized like the data.table::frollmean() function). Furthermore, you can generate all these features by any number of categorical variables and their interactions PLUS you can request those sets of features to be generated for differnt levels of time aggregations such as transactional, hourly, daily, weekly, monthly, quarterly, and yearly, all in one shot (that is, you do not have to run the function repeatedly to generate the features). Lastly, generating these kinds of time series features on the fly for only a subset of records in a data.table (typically for on-demand model scoring) is not an easy task to do correctly and quickly. However, I spent the time to make it run as fast as I could but I am open to suggestions for making it faster (that goes for any of the functions in RemixAutoML).
Data Management - Every function here is written with fully-optimized data.table code so they run blazingly fast and are as memory efficient as possible. The current set of machine learning algorithms were chosen for their ability to work with big data and their ability to outperform other models, as demonstrated across a variety of real world use cases. The focus of the package is quality, not quantity.
Documentation - Each exported function in the package has a help file and can be viewed in your RStudio session, e.g. ?RemixAutoML::ModelDataPrep. Many of them come with examples coded up in the help files (at the bottom) that you can run to get a feel for how to set the parameters. There's also a listing of exported functions by category with code examples at the bottom of this readme. You can also jump into the R folder here to dig into the source code.
Installation
1. First, install R package dependencies:
XGBoost runs significantly faster with GPU (it's already pretty fast on CPU) but it can be tricky to get installed. The blog below has been shown to be reliable for getting it done. Install XGBoost on Windows for R with GPU Capability
# Install Dependencies----
if(!("remotes" %in% rownames(installed.packages()))) install.packages("remotes"); print("remotes")
if(!("arules" %in% rownames(installed.packages()))) install.packages("arules"); print("arules")
if(!("bit64" %in% rownames(installed.packages()))) install.packages("bit64"); print("bit64")
if(!("caTools" %in% rownames(installed.packages()))) install.packages("caTools"); print("caTools")
if(!("combinat" %in% rownames(install.packages()))) install.packages("combinat"); print("combinat")
if(!("data.table" %in% rownames(installed.packages()))) install.packages("data.table"); print("data.table")
if(!("doParallel" %in% rownames(installed.packages()))) install.packages("doParallel"); print("doParallel")
if(!("e1071" %in% rownames(installed.packages()))) install.packages("e1071"); print("e1071")
if(!("fBasics" %in% rownames(installed.packages()))) install.packages("fBasics"); print("fBasics")
if(!("foreach" %in% rownames(installed.packages()))) install.packages("foreach"); print("foreach")
if(!("forecast" %in% rownames(installed.packages()))) install.packages("forecast"); print("forecast")
if(!("fpp" %in% rownames(installed.packages()))) install.packages("fpp"); print("fpp")
if(!("ggplot2" %in% rownames(installed.packages()))) install.packages("ggplot2"); print("ggplot2")
if(!("gridExtra" %in% rownames(installed.packages()))) install.packages("gridExtra"); print("gridExtra")
if(!("here" %in% rownames(installed.packages()))) install.packages("here"); print("here")
if(!("itertools" %in% rownames(installed.packages()))) install.packages("itertools"); print("itertools")
if(!("lime" %in% rownames(installed.packages()))) install.packages("lime"); print("lime")
if(!("lubridate" %in% rownames(installed.packages()))) install.packages("lubridate"); print("lubridate")
if(!("Matrix" %in% rownames(installed.packages()))) install.packages("Matrix"); print("Matrix")
if(!("MLmetrics" %in% rownames(installed.packages()))) install.packages("MLmetrics"); print("MLmetrics")
if(!("monreg" %in% rownames(installed.packages()))) install.packages("monreg"); print("monreg")
if(!("nortest" %in% rownames(installed.packages()))) install.packages("nortest"); print("nortest")
if(!("RColorBrewer" %in% rownames(installed.packages()))) install.packages("RColorBrewer"); print("RColorBrewer")
if(!("recommenderlab" %in% rownames(installed.packages()))) install.packages("recommenderlab"); print("recommenderlab")
if(!("ROCR" %in% rownames(installed.packages()))) install.packages("ROCR"); print("ROCR")
if(!("pROC" %in% rownames(installed.packages()))) install.packages("pROC"); print("pROC")
if(!("Rfast" %in% rownames(installed.packages()))) install.packages("Rfast"); print("Rfast")
if(!("scatterplot3d" %in% rownames(installed.packages()))) install.packages("scatterplot3d"); print("scatterplot3d")
if(!("stringr" %in% rownames(installed.packages()))) install.packages("stringr"); print("stringr")
if(!("sde" %in% rownames(installed.packages()))) install.packages("sde"); print("sde")
if(!("timeDate" %in% rownames(installed.packages()))) install.packages("timeDate"); print("timeDate")
if(!("tsoutliers" %in% rownames(installed.packages()))) install.packages("tsoutliers"); print("tsoutliers")
if(!("wordcloud" %in% rownames(installed.packages()))) install.packages("wordcloud"); print("wordcloud")
if(!("xgboost" %in% rownames(installed.packages()))) install.packages("xgboost"); print("xgboost")
for (pkg in c("RCurl","jsonlite")) if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
install.packages("h2o", type = "source", repos = (c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))
remotes::install_github('catboost/catboost', subdir = 'catboost/R-package')
remotes::install_github('AdrianAntico/RemixAutoML', upgrade = FALSE, dependencies = FALSE, force = TRUE)
Installation Troubleshooting
The most common issue some users are having when trying to install RemixAutoML is the installation of the catboost package dependency. Since catboost is not on CRAN it can only be installed through GitHub. To install catboost without error (and consequently install RemixAutoML without error), try running this line of code first, then restart your R session, then re-run the 2-step installation process above. (Reference): If you're still having trouble submit an issue and I'll work with you to get it installed.
# Be sure to use the version you want versus what is listed below
options(devtools.install.args = c("--no-multiarch", "--no-test-load"))
install.packages("https://github.com/catboost/catboost/releases/download/v0.17.3/catboost-R-Windows-0.17.3.tgz", repos = NULL, type = "source", INSTALL_opts = c("--no-multiarch", "--no-test-load"))
If you're having still having trouble installing see if the issue below helps out:
Common Workflows
Supervised Learning
An example workflow with function references
- Pull in data from your data warehouse (or from wherever) and clean it up
- Run all the applicable feature engineering functions, such as AutoLagRollStats(), AutoInteraction, AutoWord2VecModeler(), CreateCalendarVariables(), CreateHolidayVariables(), etc.
- Partition your data with AutoDataPartition() if you want to go with a data split other than 70/20/10, which is automatically applied in the supervised learning functions if you don't supply the ValidationData and TestData (and TrainOnFull is set to FALSE).
- Run AutoCatBoostRegression() or AutoCatBoostClassifier() or AutoCatBoostMultiClass() with GPU if you have access to one
- Run AutoXGBoostRegression() or AutoXGBoostClassifier() or AutoXGBoostMultiClass() with GPU if you have access to one
- Run AutoH2oGBMRegression() or AutoH2oGBMClassifier() or AutoH2oGBMMultiClass() if you have the patience to wait for a CPU build.
- Run AutoH2oGLMRegression() or AutoH2oGLMClassifier() or AutoH2oGLMMultiClass() if you want to give a generalized linear model a shot.
- Run AutoH2oMLRegression() or AutoH2oMLClassifier() or AutoH2oMLMultiClass() to run H2O's AutoML function inside the RemixAutoML framework.
- Run AutoH2oDRFRegression() or AutoH2oDRFClassifier() or AutoH2oDRFMultiClass() H2O's Distributed Random Forest can take a really long time to build. H2O's documentation has a great explanation for the reason why it takes much longer compared to their GBM algo.
- Investigate model performance contained in the output object returned by those functions. You will be able to look at model calibration plots or box plots, ROC plots, partial depence calibration plots or boxplots, model metrics, etc.
- If you ran one of the Auto__Classifer() function supply the validation to the function RemixClassificationMetrics() for an exhaustive threshold analysis
- Pick your model of choice and kick off an extended grid tuning and figure out something else to do that week (or run it over the weekend).
- Compare your results with your coworkers results and see what's working and what isn't. Then you can either move on or continue exploring. Bargain with your boss to get that time to explore so you can learn new things.
Forecasting
Single series and panel data using Time Series models or Machine Learning models
Supply a data.table to run the functions below:
- For single series check out AutoBanditSarima(), AutoBanditNNet(), AutoTBATS(),
AutoETS(), AutoArfima(), or AutoTS() (older function; no longer developing) 2. For panel data OR single series check out AutoCatBoostCARMA(), AutoXGBoostCARMA(), AutoH2OCARMA(),AutoCatBoostHurdleCARMA or AutoCatBoostVectorCARMA or build a loop and run functions from (1) 3. If you have to do any funnel forecasting, check out AutoCatBoostChainLadder(). First you need to structure you data like the help example. When you think you have found a good configuration, set the parameter SaveModelObjects = TRUE. Then you can run the AutoMLChainLadderForecasting().
RemixAutoML Blogs
The Most Feature Rich ML Forecasting Methods Available
AutoML Frameworks in R & Python
AI for Small to Medium Size Businesses: A Management Take On The Challenges...
Why Machine Learning is more Practical than Econometrics in the Real World
Build Thousands of Automated Demand Forecasts in 15 Minutes Using AutoCatBoostCARMA in R
Automate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS
Companies Are Demanding Model Interpretability. Here’s How To Do It Right
The Easiest Way to Create Thresholds And Improve Your Classification Model
Feature Engineering
AutoLagRollStats() and AutoLagRollStatsScoring()
# Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
datatemp <- RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 0L,
ZIP = 0L,
FactorCount = 0L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
datatemp[, Factor1 := eval(Level)]
if(Count == 1L) {
data <- data.table::copy(datatemp)
} else {
data <- data.table::rbindlist(list(data, data.table::copy(datatemp)))
}
Count <- Count + 1L
}
# Add scoring records
data <- RemixAutoML::AutoLagRollStats(
# Data
data = data,
DateColumn = "DateTime",
Targets = "Adrian",
HierarchyGroups = NULL,
IndependentGroups = c("Factor1"),
TimeUnitAgg = "days",
TimeGroups = c("days", "weeks", "months", "quarters"),
TimeBetween = NULL,
TimeUnit = "days",
# Services
RollOnLag1 = TRUE,
Type = "Lag",
SimpleImpute = TRUE,
# Calculated Columns
Lags = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1)), "quarters" = c(seq(1,2,1))),
MA_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1)), "quarters" = c(seq(1,2,1))),
SD_RollWindows = NULL,
Skew_RollWindows = NULL,
Kurt_RollWindows = NULL,
Quantile_RollWindows = NULL,
Quantiles_Selected = NULL,
Debug = FALSE)
# Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
datatemp <- RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 0L,
ZIP = 0L,
FactorCount = 0L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
datatemp[, Factor1 := eval(Level)]
if(Count == 1L) {
data <- data.table::copy(datatemp)
} else {
data <- data.table::rbindlist(list(data, data.table::copy(datatemp)))
}
Count <- Count + 1L
}
# Create ID columns to know which records to score
data[, ID := .N:1L, by = "Factor1"]
data.table::set(data, i = which(data[["ID"]] == 2L), j = "ID", value = 1L)
# Score records
data <- RemixAutoML::AutoLagRollStatsScoring(
# Data
data = data,
RowNumsID = "ID",
RowNumsKeep = 1,
DateColumn = "DateTime",
Targets = "Adrian",
HierarchyGroups = c("Store","Dept"),
IndependentGroups = NULL,
# Services
TimeBetween = NULL,
TimeGroups = c("days", "weeks", "months"),
TimeUnit = "day",
TimeUnitAgg = "day",
RollOnLag1 = TRUE,
Type = "Lag",
SimpleImpute = TRUE,
# Calculated Columns
Lags = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
MA_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
SD_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
Skew_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
Kurt_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
Quantile_RollWindows = list("days" = c(seq(1,5,1)), "weeks" = c(seq(1,3,1)), "months" = c(seq(1,2,1))),
Quantiles_Selected = c("q5","q10","q95"),
Debug = FALSE)
AutoLagRollStats() builds lags and rolling statistics by grouping variables and their interactions along with multiple different time aggregations if selected. Rolling stats include mean, sd, skewness, kurtosis, and the 5th - 95th percentiles. This function was inspired by the distributed lag modeling framework but I wanted to use it for time series analysis as well and really generalize it as much as possible. The beauty of this function is inspired by analyzing whether a baseball player will get a basehit or more in his next at bat. One easy way to get a better idea of the likelihood is to look at his batting average and his career batting average. However, players go into hot streaks and slumps. How do we account for that? Well, in comes the functions here. You look at the batting average over the last N to N+x at bats, for various N and x. I keep going though - I want the same windows for calculating the players standard deviation, skewness, kurtosis, and various quantiles over those time windows. I also want to look at all those measure but by using weekly data - as in, over the last N weeks, pull in those stats too.
AutoLagRollStatsScoring() builds the above features for a partial set of records in a data set. The function is extremely useful as it can compute these feature vectors at a significantly faster rate than the non scoring version which comes in handy for scoring ML models. If you can find a way to make it faster, let me know.
AutoInteraction()
#########################################
# Feature Engineering for Model Training
#########################################
# Create fake data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.70,
N = 50000,
ID = 2L,
FactorCount = 2L,
AddDate = TRUE,
ZIP = 0L,
TimeSeries = FALSE,
ChainLadderData = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Print number of columns
print(ncol(data))
# Store names of numeric and integer cols
Cols <-names(data)[c(which(unlist(lapply(data, is.numeric))),
which(unlist(lapply(data, is.integer))))]
# Model Training Feature Engineering
system.time(data <- RemixAutoML::AutoInteraction(
data = data,
NumericVars = Cols,
InteractionDepth = 4,
Center = TRUE,
Scale = TRUE,
SkipCols = NULL,
Scoring = FALSE,
File = getwd()))
# user system elapsed
# 0.32 0.22 0.53
# Print number of columns
print(ncol(data))
# 16
########################################
# Feature Engineering for Model Scoring
########################################
# Create fake data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.70,
N = 50000,
ID = 2L,
FactorCount = 2L,
AddDate = TRUE,
ZIP = 0L,
TimeSeries = FALSE,
ChainLadderData = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Print number of columns
print(ncol(data))
# 16
# Reduce to single row to mock a scoring scenario
data <- data[1L]
# Model Scoring Feature Engineering
system.time(data <- RemixAutoML::AutoInteraction(
data = data,
NumericVars = names(data)[
c(which(unlist(lapply(data, is.numeric))),
which(unlist(lapply(data, is.integer))))],
InteractionDepth = 4,
Center = TRUE,
Scale = TRUE,
SkipCols = NULL,
Scoring = TRUE,
File = file.path(getwd(), "Standardize.Rdata")))
# user system elapsed
# 0.19 0.00 0.19
# Print number of columns
print(ncol(data))
# 1095
AutoInteraction() will build out any number of interactions you want for numeric variables. You supply a character vector of numeric or integer column names, along with the names of any numeric columns you want to skip (including the interaction column names) and the interactions will be automatically created for you. For example, if you want a 4th degree interaction from 10 numeric columns, you will have 10 C 2, 10 C 3, and 10 C 4 columns created. Now, let's say you build all those features and decide you don't want all 10 features to be included. Remove the feature name from the NumericVars character vector. Now, let's say you modeled all of the interaction features and want to remove the ones will the lowest scores on the variable importance list. Grab the names and run the interaction function again except this time supply those poor performing interaction column names to the SkipCols argument and they will be ignored. Now, if you want to interact any categorical variable with a numeric variable, you'll have to dummify the categorical variable first and then include the level specific dummy variable column names to the NumericVars character vector argument. If you set Center and Scale to TRUE then the interaction multiplication won't create huge numbers.
AutoWord2VecModeler() and AutoWord2VecScoring()
# Create fake data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.70,
N = 1000L,
ID = 2L,
FactorCount = 2L,
AddDate = TRUE,
AddComment = TRUE,
ZIP = 2L,
TimeSeries = FALSE,
ChainLadderData = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Create Model and Vectors
data <- RemixAutoML::AutoWord2VecModeler(
data,
BuildType = "individual",
stringCol = c("Comment"),
KeepStringCol = FALSE,
ModelID = "Model_1",
model_path = getwd(),
vects = 10,
MinWords = 1,
WindowSize = 1,
Epochs = 25,
SaveModel = "standard",
Threads = max(1,parallel::detectCores()-2),
MaxMemory = "28G")
# Remove data
rm(data)
# Create fake data for mock scoring
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.70,
N = 1000L,
ID = 2L,
FactorCount = 2L,
AddDate = TRUE,
AddComment = TRUE,
ZIP = 2L,
TimeSeries = FALSE,
ChainLadderData = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Create vectors for scoring
data <- RemixAutoML::AutoWord2VecScoring(
data,
BuildType = "individual",
ModelObject = NULL,
ModelID = "Model_1",
model_path = getwd(),
stringCol = "Comment",
KeepStringCol = FALSE,
H2OStartUp = TRUE,
H2OShutdown = TRUE,
Threads = max(1L, parallel::detectCores() - 2L),
MaxMemory = "28G")
AutoWord2VecModeler() generates a specified number of vectors (word2vec) for each column of text data in your data set that you specify and it will save the models if you specify for re-creating them later in a model scoring process. You can choose to build individual models for each column or one model for all your columns. If you need to run several models for groups of text variables you can run the function several times.
AutoWord2VecScoring() this is for generating word2vec vectors for model scoring situations. The function will load the model, create the transformations, and merge them onto the source data.table just like the training version does.
CreateCalendarVariables()
# Create fake data with a Date column----
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 2L,
ZIP = 0L,
FactorCount = 4L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
for(i in seq_len(20L)) {
print(i)
data <- data.table::rbindlist(list(data, RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 2L,
ZIP = 0L,
FactorCount = 4L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)))
}
# Create calendar variables - automatically excludes the second, minute, and hour selections since
# it is not timestamp data
runtime <- system.time(
data <- RemixAutoML::CreateCalendarVariables(
data = data,
DateCols = "DateTime",
AsFactor = FALSE,
TimeUnits = c("second", "minute", "hour", "wday", "mday", "yday", "week", "isoweek", "wom", "month", "quarter", "year")))
head(data)
print(runtime)
CreateCalendarVariables() This functions creates numerical columns based on the date columns you supply such as second, minute, hour, week day, day of month, day of year, week, isoweek, wom, month, quarter, and year.
CreateHolidayVariable()
# Create fake data with a Date----
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 2L,
ZIP = 0L,
FactorCount = 4L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
for(i in seq_len(20L)) {
print(i)
data <- data.table::rbindlist(list(data, RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 25000L,
ID = 2L,
ZIP = 0L,
FactorCount = 4L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)))
}
# Run function and time it
runtime <- system.time(
data <- CreateHolidayVariables(
data,
DateCols = "DateTime",
LookbackDays = 7,
HolidayGroups = c("USPublicHolidays","EasterGroup","ChristmasGroup","OtherEcclesticalFeasts"),
Holidays = NULL
Print = FALSE))
head(data)
print(runtime)
CreateHolidayVariable() This function counts up the number of specified holidays between the current record time stamp and the previous record time stamp, by group as well if specified.
AutoHierarchicalFourier()
AutoHierarchicalFourier() turns time series data into fourier series. This function can generate any number of fourier pairs the user wants (if they can actually build) and you can run it with grouped time series data. In the grouping case, fourier pairs can be created for each categorical variable along with the full interactions between specified categoricals. The process is parallelized as well to run as fast as possible.
AutoTransformationCreate() and AutoTransformationScore()
AutoTransformationCreate() is a function for automatically identifying the optimal transformations for numeric features and transforming them once identified. This function will loop through your selected transformation options (YeoJohnson, BoxCox, Asinh, Log, LogPlus1, Sqrt, along with Asin and Logit for proportion data) and find the one that produces the best fit to a normal distribution. It then generates the transformation and collects the metadata information for use in the AutoTransformationScore() function, either by returning the objects or saving them to file.
AutoTransformationScore() is a the compliment function to AutoTransformationCreate(). Automatically apply or inverse the transformations you identified in AutoTransformationCreate() to other data sets. This is useful for applying transformations to your validation and test data sets for modeling, which is done automatically for you if you specify.
DummifyDT()
# Create fake data with 10 categorical columns
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 25000,
ID = 2L,
ZIP = 0,
FactorCount = 10L,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Create dummy variables
data <- DummifyDT(
data = data,
cols = c("Factor_1",
"Factor_2",
"Factor_3",
"Factor_4",
"Factor_5",
"Factor_6",
"Factor_8",
"Factor_9",
"Factor_10"),
TopN = c(rep(3,9)),
KeepFactorCols = TRUE,
OneHot = FALSE,
SaveFactorLevels = TRUE,
SavePath = getwd(),
ImportFactorLevels = FALSE,
FactorLevelsList = NULL,
ClustScore = FALSE,
ReturnFactorLevels = FALSE)
# Create Fake Data for Scoring Replication
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 25000,
ID = 2L,
ZIP = 0,
FactorCount = 10L,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Scoring Version (imports csv's to generate matching levels and ordering)
data <- RemixAutoML::DummifyDT(
data = data,
cols = c("Factor_1",
"Factor_2",
"Factor_3",
"Factor_4",
"Factor_5",
"Factor_6",
"Factor_8",
"Factor_9",
"Factor_10"),
TopN = c(rep(3,9)),
KeepFactorCols = TRUE,
OneHot = FALSE,
SaveFactorLevels = TRUE,
SavePath = getwd(),
ImportFactorLevels = TRUE,
FactorLevelsList = NULL,
ClustScore = FALSE,
ReturnFactorLevels = FALSE)
DummifyDT() This function is used in the AutoXGBoost__() suite of modeling functions to manage categorical variables in your training, validation, and test sets. This function rapidly dichotomizes categorical columns in a data.table (N+1 columns for N levels using one hot encoding or N columns for N levels otherwise). Several other arguments exist for outputting and saving factor levels. This is useful in model training, validating, and scoring processes.
DifferenceData() and DifferenceDataReverse()
DifferenceData() Create differences in your data (y1 - y0) for grouped or non-grouped data. DifferenceDataReverse() Reverses the differences in your data for grouped or non-grouped data.
ModelDataPrep()
# Create fake data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.75,
N = 250000L,
ID = 2L,
ZIP = 0L,
FactorCount = 6L,
AddDate = TRUE,
Classification = FALSE,
MultiClass = FALSE)
# Check column types
str(data)
# Convert some factors to character
data <- RemixAutoML::ModelDataPrep(
data,
Impute = TRUE,
CharToFactor = FALSE,
FactorToChar = TRUE,
IntToNumeric = TRUE,
DateToChar = FALSE,
RemoveDates = TRUE,
MissFactor = "0",
MissNum = -1,
IgnoreCols = c("Factor_1"))
# Check column types
str(data)
ModelDataPrep() This function will loop through every column in your data and apply a variety of functions based on argument settings. For all columns not ignored, these tasks include:
- Character type to Factor type converstion
- Factor type to Character type conversion
- Constant value imputation for numeric and categorical columns
- Integer type to Numeric type conversion
- Date type to Character type conversion
- Remove date columns
- Ignore specified columns
AutoDataPartition()
# Create fake data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000,
ID = 2,
ZIP = 0,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Run data partitioning function
dataSets <- RemixAutoML::AutoDataPartition(
data,
NumDataSets = 3L,
Ratios = c(0.70,0.20,0.10),
PartitionType = "random",
StratifyColumnNames = NULL,
StratifyNumericTarget = NULL,
StratTargetPrecision = 1L,
TimeColumnName = NULL)
# Collect data
TrainData <- dataSets$TrainData
ValidationData <- dataSets$ValidationData
TestData <- dataSets$TestData
AutoDataPartition() is designed to achieve a few things that standard data partitioning processes or functions don't handle. First, you can choose to build any number of partitioned data sets beyond the standard train, validate, and test data sets. Second, you can choose between random sampling to split your data or you can choose a time-based partitioning. Third, for the random partitioning, you can specify a stratification columns in your data to stratify by in order to ensure a proper split amongst your categorical features (E.g. think MultiClass targets). Lastly, it's 100% data.table so it will run fast and with low memory overhead.
Supervised Learning
Regression
AutoCatBoostRegression() GPU Capable
AutoCatBoostRegression() utilizes the CatBoost algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 10000,
ID = 2,
ZIP = 0,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Run function
TestModel <- RemixAutoML::AutoCatBoostRegression(
# GPU or CPU and the number of available GPUs
task_type = "GPU",
NumGPUs = 1,
# Metadata args
ModelID = "Test_Model_1",
model_path = normalizePath("./"),
metadata_path = normalizePath("./"),
SaveModelObjects = FALSE,
SaveInfoToPDF = FALSE,
ReturnModelObjects = TRUE,
# Data args
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
Weights = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in%
c("IDcol_1", "IDcol_2","Adrian")],
PrimaryDateColumn = NULL,
DummifyCols = FALSE,
IDcols = c("IDcol_1","IDcol_2"),
TransformNumericColumns = "Adrian",
Methods = c("BoxCox", "Asinh", "Asin", "Log",
"LogPlus1", "Sqrt", "Logit", "YeoJohnson"),
# Model evaluation
eval_metric = "RMSE",
eval_metric_value = 1.5,
loss_function = "RMSE",
loss_function_value = 1.5,
MetricPeriods = 10L,
NumOfParDepPlots = ncol(data)-1L-2L,
EvalPlots = TRUE,
# Grid tuning args
PassInGrid = NULL,
GridTune = FALSE,
MaxModelsInGrid = 30L,
MaxRunsWithoutNewWinner = 20L,
MaxRunMinutes = 60*60,
Shuffles = 4L,
BaselineComparison = "default",
# ML args
langevin = FALSE,
diffusion_temperature = 10000,
Trees = 1000,
Depth = 6,
L2_Leaf_Reg = 3.0,
RandomStrength = 1,
BorderCount = 128,
LearningRate = NULL,
RSM = 1,
BootStrapType = NULL,
GrowPolicy = "SymmetricTree",
model_size_reg = 0.5,
feature_border_type = "GreedyLogSum",
sampling_unit = "Group",
subsample = NULL,
score_function = "Cosine",
min_data_in_leaf = 1)
# Output
TestModel$Model
TestModel$ValidationData
TestModel$EvaluationPlot
TestModel$EvaluationBoxPlot
TestModel$EvaluationMetrics
TestModel$VariableImportance
TestModel$InteractionImportance
TestModel$ShapValuesDT
TestModel$VI_Plot
TestModel$PartialDependencePlots
TestModel$PartialDependenceBoxPlots
TestModel$GridList
TestModel$ColNames
TestModel$TransformationResults
AutoXGBoostRegression() GPU Capable
AutoXGBoostRegression() utilizes the XGBoost algorithm in the below steps
#' # Create some dummy correlated data
#' data <- RemixAutoML::FakeDataGenerator(
#' Correlation = 0.85,
#' N = 1000,
#' ID = 2,
#' ZIP = 0,
#' AddDate = FALSE,
#' Classification = FALSE,
#' MultiClass = FALSE)
#'
#' # Run function
#' TestModel <- RemixAutoML::AutoXGBoostRegression(
#'
#' # GPU or CPU
#' TreeMethod = "hist",
#' NThreads = parallel::detectCores(),
#' LossFunction = 'reg:squarederror',
#'
#' # Metadata args
#' model_path = normalizePath("./"),
#' metadata_path = NULL,
#' ModelID = "Test_Model_1",
#' ReturnFactorLevels = TRUE,
#' ReturnModelObjects = TRUE,
#' SaveModelObjects = FALSE,
#'
#' # Data args
#' data = data,
#' TrainOnFull = FALSE,
#' ValidationData = NULL,
#' TestData = NULL,
#' TargetColumnName = "Adrian",
#' FeatureColNames = names(data)[!names(data) %in%
#' c("IDcol_1", "IDcol_2","Adrian")],
#' IDcols = c("IDcol_1","IDcol_2"),
#' TransformNumericColumns = NULL,
#' Methods = c("BoxCox", "Asinh", "Asin", "Log",
#' "LogPlus1", "Sqrt", "Logit", "YeoJohnson"),
#'
#' # Model evaluation args
#' eval_metric = "rmse",
#' NumOfParDepPlots = 3L,
#'
#' # Grid tuning args
#' PassInGrid = NULL,
#' GridTune = FALSE,
#' grid_eval_metric = "mse",
#' BaselineComparison = "default",
#' MaxModelsInGrid = 10L,
#' MaxRunsWithoutNewWinner = 20L,
#' MaxRunMinutes = 24L*60L,
#' Verbose = 1L,
#'
#' # ML args
#' Shuffles = 1L,
#' Trees = 50L,
#' eta = 0.05,
#' max_depth = 4L,
#' min_child_weight = 1.0,
#' subsample = 0.55,
#' colsample_bytree = 0.55)
AutoH2oGBMRegression()
AutoH2oGBMRegression() utilizes the H2O Gradient Boosting algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000,
ID = 2,
ZIP = 0,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Run function
TestModel <- RemixAutoML::AutoH2oGBMRegression(
# Compute management
MaxMem = {gc();paste0(as.character(floor(as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE)) / 1000000)),"G")},
NThreads = max(1, parallel::detectCores()-2),
H2OShutdown = TRUE,
H2OStartUp = TRUE,
IfSaveModel = "mojo",
# Model evaluation
NumOfParDepPlots = 3,
# Metadata arguments:
model_path = normalizePath("./"),
metadata_path = file.path(normalizePath("./")),
ModelID = "FirstModel",
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
SaveInfoToPDF = FALSE,
# Data arguments
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1", "IDcol_2","Adrian")],
WeightsColumn = NULL,
TransformNumericColumns = NULL,
Methods = c("BoxCox", "Asinh", "Asin", "Log", "LogPlus1", "Sqrt", "Logit","YeoJohnson"),
# ML grid tuning args
GridTune = FALSE,
GridStrategy = "Cartesian",
MaxRuntimeSecs = 60*60*24,
StoppingRounds = 10,
MaxModelsInGrid = 2,
# Model args
Trees = 50,
LearnRate = 0.10,
LearnRateAnnealing = 1,
eval_metric = "RMSE",
Alpha = NULL,
Distribution = "poisson",
MaxDepth = 20,
SampleRate = 0.632,
ColSampleRate = 1,
ColSampleRatePerTree = 1,
ColSampleRatePerTreeLevel = 1,
MinRows = 1,
NBins = 20,
NBinsCats = 1024,
NBinsTopLevel = 1024,
HistogramType = "AUTO",
CategoricalEncoding = "AUTO")
AutoH2oDRFRegression()
AutoH2oDRFRegression() utilizes the H2o Distributed Random Forest algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000,
ID = 2,
ZIP = 0,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Run function
TestModel <- RemixAutoML::AutoH2oDRFRegression(
# Compute management
MaxMem = {gc();paste0(as.character(floor(as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE)) / 1000000)),"G")},
NThreads = max(1L, parallel::detectCores() - 2L),
H2OShutdown = TRUE,
H2OStartUp = TRUE,
IfSaveModel = "mojo",
# Model evaluation:
eval_metric = "RMSE",
NumOfParDepPlots = 3,
# Metadata arguments:
model_path = normalizePath("./"),
metadata_path = NULL,
ModelID = "FirstModel",
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
SaveInfoToPDF = FALSE,
# Data Args
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1", "IDcol_2","Adrian")],
WeightsColumn = NULL,
TransformNumericColumns = NULL,
Methods = c("BoxCox", "Asinh", "Asin", "Log", "LogPlus1", "Sqrt", "Logit", "YeoJohnson"),
# Grid Tuning Args
GridStrategy = "Cartesian",
GridTune = FALSE,
MaxModelsInGrid = 10,
MaxRuntimeSecs = 60*60*24,
StoppingRounds = 10,
# ML Args
Trees = 50,
MaxDepth = 20,
SampleRate = 0.632,
MTries = -1,
ColSampleRatePerTree = 1,
ColSampleRatePerTreeLevel = 1,
MinRows = 1,
NBins = 20,
NBinsCats = 1024,
NBinsTopLevel = 1024,
HistogramType = "AUTO",
CategoricalEncoding = "AUTO")
AutoH2oGLMRegression()
AutoH2oGLMRegression() utilizes the H2o generalized linear model algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000,
ID = 2,
ZIP = 0,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Run function
TestModel <- RemixAutoML::AutoH2oGLMRegression(
# Compute management
MaxMem = {gc();paste0(as.character(floor(as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE)) / 1000000)),"G")},
NThreads = max(1, parallel::detectCores()-2),
H2OShutdown = TRUE,
H2OStartUp = TRUE,
IfSaveModel = "mojo",
# Model evaluation:
eval_metric = "RMSE",
NumOfParDepPlots = 3,
# Metadata arguments:
model_path = NULL,
metadata_path = NULL,
ModelID = "FirstModel",
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
SaveInfoToPDF = FALSE,
# Data arguments:
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1", "IDcol_2","Adrian")],
RandomColNumbers = NULL,
InteractionColNumbers = NULL,
WeightsColumn = NULL,
TransformNumericColumns = NULL,
Methods = c("BoxCox", "Asinh", "Asin", "Log", "LogPlus1", "Sqrt", "Logit", "YeoJohnson"),
# Model args
GridTune = FALSE,
GridStrategy = "Cartesian",
StoppingRounds = 10,
MaxRunTimeSecs = 3600 * 24 * 7,
MaxModelsInGrid = 10,
Distribution = "gaussian",
Link = "identity",
TweedieLinkPower = NULL,
TweedieVariancePower = NULL,
RandomDistribution = NULL,
RandomLink = NULL,
Solver = "AUTO",
Alpha = NULL,
Lambda = NULL,
LambdaSearch = FALSE,
NLambdas = -1,
Standardize = TRUE,
RemoveCollinearColumns = FALSE,
InterceptInclude = TRUE,
NonNegativeCoefficients = FALSE)
AutoH2oMLRegression()
AutoH2oMLRegression() utilizes the H2o AutoML algorithm in the below steps
# Create some dummy correlated data with numeric and categorical features
data <- RemixAutoML::FakeDataGenerator(Correlation = 0.85, N = 1000, ID = 2, ZIP = 0, AddDate = FALSE, Classification = FALSE, MultiClass = FALSE)
# Run function
TestModel <- RemixAutoML::AutoH2oMLRegression(
# Compute management
MaxMem = "32G",
NThreads = max(1, parallel::detectCores()-2),
H2OShutdown = TRUE,
IfSaveModel = "mojo",
# Model evaluation:
# 'eval_metric' is the measure catboost uses when evaluting on holdout data during its bandit style process
# 'NumOfParDepPlots' Number of partial dependence calibration plots generated.
# A value of 3 will return plots for the top 3 variables based on variable importance
# Won't be returned if GrowPolicy is either "Depthwise" or "Lossguide" is used
# Can run the RemixAutoML::ParDepCalPlots() with the outputted ValidationData
eval_metric = "RMSE",
NumOfParDepPlots = 3,
# Metadata arguments:
# 'ModelID' is used to create part of the file names generated when saving to file'
# 'model_path' is where the minimal model objects for scoring will be stored
# 'ModelID' will be the name of the saved model object
# 'metadata_path' is where model evaluation and model interpretation files are saved
# objects saved to model_path if metadata_path is null
# Saved objects include:
# 'ModelID_ValidationData.csv' is the supplied or generated TestData with predicted values
# 'ModelID_VariableImportance.csv' is the variable importance.
# This won't be saved to file if GrowPolicy is either "Depthwise" or "Lossguide" was used
# 'ModelID_ExperimentGrid.csv' if GridTune = TRUE.
# Results of all model builds including parameter settings, bandit probs, and grid IDs
# 'ModelID_EvaluationMetrics.csv' which contains MSE, MAE, MAPE, R2
model_path = NULL,
metadata_path = NULL,
ModelID = "FirstModel",
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
# Data arguments:
# 'TrainOnFull' is to train a model with 100 percent of your data.
# That means no holdout data will be used for evaluation
# If ValidationData and TestData are NULL and TrainOnFull is FALSE then data will be split 70 20 10
# 'PrimaryDateColumn' is a date column in data that is meaningful when sorted.
# CatBoost categorical treatment is enhanced when supplied
# 'IDcols' are columns in your data that you don't use for modeling but get returned with ValidationData
# 'TransformNumericColumns' is for transforming your target variable. Just supply the name of it
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1", "IDcol_2","Adrian")],
TransformNumericColumns = NULL,
Methods = c("BoxCox", "Asinh", "Asin", "Log", "LogPlus1", "Logit", "YeoJohnson"),
# Model args
GridTune = FALSE,
ExcludeAlgos = NULL,
Trees = 50,
MaxModelsInGrid = 10)
AutoH2oGAMRegression()
AutoH2oGLMRegression() utilizes the H2o generalized linear model algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000,
ID = 2,
ZIP = 0,
AddDate = FALSE,
Classification = FALSE,
MultiClass = FALSE)
# Define GAM Columns to use - up to 9 are allowed
GamCols <- names(which(unlist(lapply(data, is.numeric))))
GamCols <- GamCols[!GamCols %in% c("Adrian","IDcol_1","IDcol_2")]
GamCols <- GamCols[1L:(min(9L,length(GamCols)))]
# Run function
TestModel <- RemixAutoML::AutoH2oGAMRegression(
# Compute management
MaxMem = {gc();paste0(as.character(floor(as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE)) / 1000000)),"G")},
NThreads = max(1, parallel::detectCores()-2),
H2OShutdown = TRUE,
H2OStartUp = TRUE,
IfSaveModel = "mojo",
# Model evaluation:
eval_metric = "RMSE",
NumOfParDepPlots = 3,
# Metadata arguments:
model_path = NULL,
metadata_path = NULL,
ModelID = "FirstModel",
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
SaveInfoToPDF = FALSE,
# Data arguments:
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1", "IDcol_2","Adrian")],
InteractionColNumbers = NULL,
WeightsColumn = NULL,
GamColNames = GamCols,
TransformNumericColumns = NULL,
Methods = c("BoxCox", "Asinh", "Asin", "Log", "LogPlus1", "Sqrt", "Logit", "YeoJohnson"),
# Model args
num_knots = NULL,
keep_gam_cols = TRUE,
GridTune = FALSE,
GridStrategy = "Cartesian",
StoppingRounds = 10,
MaxRunTimeSecs = 3600 * 24 * 7,
MaxModelsInGrid = 10,
Distribution = "gaussian",
Link = "Family_Default",
TweedieLinkPower = NULL,
TweedieVariancePower = NULL,
Solver = "AUTO",
Alpha = NULL,
Lambda = NULL,
LambdaSearch = FALSE,
NLambdas = -1,
Standardize = TRUE,
RemoveCollinearColumns = FALSE,
InterceptInclude = TRUE,
NonNegativeCoefficients = FALSE)
The Auto_Regression() models handle a multitude of tasks. In order:
- Convert your data to data.table format for faster processing
- Transform your target variable using the best normalization method based on the AutoTransformationCreate() function
- Create train, validation, and test data, utilizing the AutoDataPartition() function, if you didn't supply those directly to the function
- Consoldate columns that are used for modeling and what metadata you want returned in your test data with predictions
- Dichotomize categorical variables (for AutoXGBoostRegression()) and save the factor levels for scoring in a way that guarentees consistency across training, validation, and test data sets, utilizing the DummifyDT() function
- Save the final modeling column names for reference
- Handles the data conversion to the appropriate modeling type, such as CatBoost, H2O, and XGBoost
- Multi-armed bandit hyperparameter tuning using randomized probability matching, if you choose to grid tune
- Loop through the grid-tuning process, building N models
- Collect the evaluation metrics for each grid tune run
- Identify the best model of the set of models built in the grid tuning search
- Save the hyperparameters from the winning grid tuned model
- Build the final model based on the best model from the grid tuning model search (I remove each model after evaluation metrics are generated in the grid tune to avoid memory overflow)
- Back-transform your predictions based on the best transformation used earlier in the process
- Collect evaluation metrics based on performance on test data (based on back-transformed data)
- Store the final predictions with the associated test data and other columns you want included in that set
- Save your transformation metadata for recreating them in a scoring process
- Build out and save an Evaluation Calibration Line Plot and Evaluation Calibration Box-Plot, using the EvalPlot() function
- Generate and save Variable Importance
- Generate and save Partital Dependence Calibration Line Plots and Partital Dependence Calibration Box-Plots, using the ParDepPlots() function
- Return all the objects generated in a named list for immediate use and evaluation
Binary Classification
AutoCatBoostClassifier() GPU Capable
AutoCatBoostClassifier() utilizes the CatBoost algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 10000,
ID = 2,
ZIP = 0,
AddDate = FALSE,
Classification = TRUE,
MultiClass = FALSE)
# Run function
TestModel <- RemixAutoML::AutoCatBoostClassifier(
# GPU or CPU and the number of available GPUs
task_type = "GPU",
NumGPUs = 1,
# Metadata args
ModelID = "Test_Model_1",
model_path = normalizePath("./"),
metadata_path = normalizePath("./"),
SaveModelObjects = FALSE,
ReturnModelObjects = TRUE,
SaveInfoToPDF = FALSE,
# Data args
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1","IDcol_2","Adrian")],
PrimaryDateColumn = NULL,
ClassWeights = c(1L,1L),
IDcols = c("IDcol_1","IDcol_2"),
# Evaluation args
eval_metric = "AUC",
loss_function = "Logloss",
MetricPeriods = 10L,
NumOfParDepPlots = ncol(data)-1L-2L,
# Grid tuning args
PassInGrid = NULL,
GridTune = TRUE,
MaxModelsInGrid = 30L,
MaxRunsWithoutNewWinner = 20L,
MaxRunMinutes = 24L*60L,
Shuffles = 4L,
BaselineComparison = "default",
# ML args
Trees = seq(100L, 500L, 50L),
Depth = seq(4L, 8L, 1L),
LearningRate = seq(0.01,0.10,0.01),
L2_Leaf_Reg = seq(1.0, 10.0, 1.0),
RandomStrength = 1,
BorderCount = 128,
RSM = c(0.80, 0.85, 0.90, 0.95, 1.0),
BootStrapType = c("Bayesian", "Bernoulli", "Poisson", "MVS", "No"),
GrowPolicy = c("SymmetricTree", "Depthwise", "Lossguide"),
langevin = FALSE,
diffusion_temperature = 10000,
model_size_reg = 0.5,
feature_border_type = "GreedyLogSum",
sampling_unit = "Group",
subsample = NULL,
score_function = "Cosine",
min_data_in_leaf = 1)
# Output
TestModel$Model
TestModel$ValidationData
TestModel$ROC_Plot
TestModel$EvaluationPlot
TestModel$EvaluationMetrics
TestModel$VariableImportance
TestModel$InteractionImportance
TestModel$ShapValuesDT
TestModel$VI_Plot
TestModel$PartialDependencePlots
TestModel$GridMetrics
TestModel$ColNames
AutoXGBoostClassifier() GPU Capable
AutoXGBoostClassifier() utilizes the XGBoost algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000L,
ID = 2L,
ZIP = 0L,
AddDate = FALSE,
Classification = TRUE,
MultiClass = FALSE)
# Run function
TestModel <- RemixAutoML::AutoXGBoostClassifier(
# GPU or CPU
TreeMethod = "hist",
NThreads = parallel::detectCores(),
# Metadata args
model_path = normalizePath("./"),
metadata_path = NULL,
ModelID = "Test_Model_1",
ReturnFactorLevels = TRUE,
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
# Data args
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in%
c("IDcol_1", "IDcol_2","Adrian")],
IDcols = c("IDcol_1","IDcol_2"),
# Model evaluation
LossFunction = 'reg:logistic',
eval_metric = "auc",
NumOfParDepPlots = 3L,
# Grid tuning args
PassInGrid = NULL,
GridTune = FALSE,
BaselineComparison = "default",
MaxModelsInGrid = 10L,
MaxRunsWithoutNewWinner = 20L,
MaxRunMinutes = 24L*60L,
Verbose = 1L,
# ML args
Shuffles = 1L,
Trees = 50L,
eta = 0.05,
max_depth = 4L,
min_child_weight = 1.0,
subsample = 0.55,
colsample_bytree = 0.55)
AutoH2oGBMClassifier()
AutoH2oGBMClassifier() utilizes the H2O Gradient Boosting algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000L,
ID = 2L,
ZIP = 0L,
AddDate = FALSE,
Classification = TRUE,
MultiClass = FALSE)
TestModel <- RemixAutoML::AutoH2oGBMClassifier(
# Compute management
MaxMem = {gc();paste0(as.character(floor(as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE)) / 1000000)),"G")},
NThreads = max(1, parallel::detectCores()-2),
H2OShutdown = TRUE,
H2OStartUp = TRUE,
IfSaveModel = "mojo",
# Model evaluation
NumOfParDepPlots = 3,
# Metadata arguments:
model_path = normalizePath("./"),
metadata_path = file.path(normalizePath("./")),
ModelID = "FirstModel",
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
SaveInfoToPDF = FALSE,
# Data arguments
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1", "IDcol_2","Adrian")],
WeightsColumn = NULL,
# ML grid tuning args
GridTune = FALSE,
GridStrategy = "Cartesian",
MaxRuntimeSecs = 60*60*24,
StoppingRounds = 10,
MaxModelsInGrid = 2,
# Model args
Trees = 50,
LearnRate = 0.10,
LearnRateAnnealing = 1,
eval_metric = "auc",
Distribution = "bernoulli",
MaxDepth = 20,
SampleRate = 0.632,
ColSampleRate = 1,
ColSampleRatePerTree = 1,
ColSampleRatePerTreeLevel = 1,
MinRows = 1,
NBins = 20,
NBinsCats = 1024,
NBinsTopLevel = 1024,
HistogramType = "AUTO",
CategoricalEncoding = "AUTO")
AutoH2oDRFClassifier()
AutoH2oDRFClassifier() utilizes the H2O Distributed Random Forest algorithm in the below steps
# Create some dummy correlated data
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000L,
ID = 2L,
ZIP = 0L,
AddDate = FALSE,
Classification = TRUE,
MultiClass = FALSE)
TestModel <- RemixAutoML::AutoH2oDRFClassifier(
# Compute management
MaxMem = {gc();paste0(as.character(floor(as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE)) / 1000000)),"G")},
NThreads = max(1L, parallel::detectCores() - 2L),
IfSaveModel = "mojo",
H2OShutdown = FALSE,
H2OStartUp = TRUE,
# Metadata arguments:
eval_metric = "auc",
NumOfParDepPlots = 3L,
# Data arguments:
model_path = normalizePath("./"),
metadata_path = NULL,
ModelID = "FirstModel",
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
SaveInfoToPDF = FALSE,
# Model evaluation:
data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1", "IDcol_2", "Adrian")],
WeightsColumn = NULL,
# Grid Tuning Args
GridStrategy = "Cartesian",
GridTune = FALSE,
MaxModelsInGrid = 10,
MaxRuntimeSecs = 60*60*24,
StoppingRounds = 10,
# Model args
Trees = 50L,
MaxDepth = 20,
SampleRate = 0.632,
MTries = -1,
ColSampleRatePerTree = 1,
ColSampleRatePerTreeLevel = 1,
MinRows = 1,
NBins = 20,
NBinsCats = 1024,
NBinsTopLevel = 1024,
HistogramType = "AUTO",
CategoricalEncoding = "AUTO")
AutoH2oGLMClassifier()
AutoH2oGLMClassifier() utilizes the H2O generalized linear model algorithm in the below steps
# Create some dummy correlated data with numeric and categorical features
data <- RemixAutoML::FakeDataGenerator(
Correlation = 0.85,
N = 1000L,
ID = 2L,
ZIP = 0L,
AddDate = FALSE,
Classification = TRUE,
MultiClass = FALSE)
# Run function
TestModel <- RemixAutoML::AutoH2oGLMClassifier(
# Compute management
MaxMem = {gc();paste0(as.character(floor(as.numeric(system("awk '/MemFree/ {print $2}' /proc/meminfo", intern=TRUE)) / 1000000)),"G")},
NThreads = max(1, parallel::detectCores()-2),
H2OShutdown = TRUE,
H2OStartUp = TRUE,
IfSaveModel = "mojo",
# Model evaluation args
eval_metric = "auc",
NumOfParDepPlots = 3,
# Metadata args
model_path = NULL,
metadata_path = NULL,
ModelID = "FirstModel",
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
SaveInfoToPDF = FALSE,
# Data args
data = data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in%
c("IDcol_1", "IDcol_2","Adrian")],
RandomColNumbers = NULL,
InteractionColNumbers = NULL,
WeightsColumn = NULL,
TransformNumericColumns = NULL,
Methods = c("BoxCox", "Asinh", "Asin", "Log", "LogPlus1", "Sqrt", "Logit", "YeoJohnson"),
# ML args
GridTune = FALSE,
GridStrategy = "Cartesian",
StoppingRounds = 10,
MaxRunTimeSecs = 3600 * 24 * 7,
MaxModelsInGrid = 10,
Distribution = "binomial",
Link = "logit",
RandomDistribution = NULL,
RandomLink = NULL,
Solver = "AUTO",
Alpha = NULL,
Lambda = NULL,
LambdaSearch = FALSE,
NLambdas = -1,
Standardize = TRUE,
RemoveCollinearColumns = FALSE,
InterceptInclude = TRUE,
NonNegativeCoefficients = FALSE)
AutoH2oMLClassifier()
AutoH2oMLClassifier() utilizes the H2o AutoML algorithm in the below steps
# Create some dummy correlated data with numeric and categorical features
data <- RemixAutoML::FakeDataGenerator(Correlation = 0.85, N = 1000L, ID = 2L, ZIP = 0L, AddDate = FALSE, Classification = TRUE, MultiClass = FALSE)
TestModel <- RemixAutoML::AutoH2oMLClassifier(
data,
TrainOnFull = FALSE,
ValidationData = NULL,
TestData = NULL,
TargetColumnName = "Adrian",
FeatureColNames = names(data)[!names(data) %in% c("IDcol_1", "IDcol_2","Adrian")],
ExcludeAlgos = NULL,
eval_metric = "auc",
Trees = 50,
MaxMem = "32G",
NThreads = max(1, parallel::detectCores()-2),
MaxModelsInGrid = 10,
model_path = normaliz