Learn R Programming

RemixAutoML (version 0.4.2)

AutoWord2VecModeler: Automated word2vec data generation via H2O

Description

This function allows you to automatically build a word2vec model and merge the data onto your supplied dataset

Usage

AutoWord2VecModeler(
  data,
  BuildType = "Combined",
  stringCol = c("Text_Col1", "Text_Col2"),
  KeepStringCol = FALSE,
  model_path = NULL,
  vects = 100,
  MinWords = 1,
  WindowSize = 12,
  Epochs = 25,
  SaveModel = "standard",
  Threads = max(1L, parallel::detectCores() - 2L),
  MaxMemory = "28G",
  ModelID = "Model_1"
)

Arguments

data

Source data table to merge vects onto

BuildType

Choose from "individual" or "combined". Individual will build a model for every text column. Combined will build a single model for all columns.

stringCol

A string name for the column to convert via word2vec

KeepStringCol

Set to TRUE if you want to keep the original string column that you convert via word2vec

model_path

A string path to the location where you want the model and metadata stored

vects

The number of vectors to retain from the word2vec model

MinWords

For H2O word2vec model

WindowSize

For H2O word2vec model

Epochs

For H2O word2vec model

SaveModel

Set to "standard" to save normally; set to "mojo" to save as mojo. NOTE: while you can save a mojo, I haven't figured out how to score it in the AutoH20Scoring function.

Threads

Number of available threads you want to dedicate to model building

MaxMemory

Amount of memory you want to dedicate to model building

ModelID

Name for saving to file

See Also

Other Feature Engineering: AutoDataPartition(), AutoHierarchicalFourier(), AutoInteraction(), AutoLagRollStatsScoring(), AutoLagRollStats(), AutoTransformationCreate(), AutoTransformationScore(), AutoWord2VecScoring(), ContinuousTimeDataGenerator(), CreateCalendarVariables(), CreateHolidayVariables(), DT_GDL_Feature_Engineering(), DifferenceDataReverse(), DifferenceData(), DummifyDT(), H2oAutoencoder(), ModelDataPrep(), Partial_DT_GDL_Feature_Engineering(), TimeSeriesFill()

Examples

Run this code
# NOT RUN {
# Create fake data
data <- RemixAutoML::FakeDataGenerator(
  Correlation = 0.70,
  N = 1000L,
  ID = 2L,
  FactorCount = 2L,
  AddDate = TRUE,
  AddComment = TRUE,
  ZIP = 2L,
  TimeSeries = FALSE,
  ChainLadderData = FALSE,
  Classification = FALSE,
  MultiClass = FALSE)

# Create Model and Vectors
data <- RemixAutoML::AutoWord2VecModeler(
  data,
  BuildType = "individual",
  stringCol = c("Comment"),
  KeepStringCol = FALSE,
  ModelID = "Model_1",
  model_path = getwd(),
  vects = 10,
  MinWords = 1,
  WindowSize = 1,
  Epochs = 25,
  SaveModel = "standard",
  Threads = max(1,parallel::detectCores()-2),
  MaxMemory = "28G")

# Remove data
rm(data)

# Create fake data for mock scoring
data <- RemixAutoML::FakeDataGenerator(
  Correlation = 0.70,
  N = 1000L,
  ID = 2L,
  FactorCount = 2L,
  AddDate = TRUE,
  AddComment = TRUE,
  ZIP = 2L,
  TimeSeries = FALSE,
  ChainLadderData = FALSE,
  Classification = FALSE,
  MultiClass = FALSE)

# Create vectors for scoring
data <- RemixAutoML::AutoWord2VecScoring(
  data,
  BuildType = "individual",
  ModelObject = NULL,
  ModelID = "Model_1",
  model_path = getwd(),
  stringCol = "Comment",
  KeepStringCol = FALSE,
  H2OStartUp = TRUE,
  H2OShutdown = TRUE,
  Threads = max(1L, parallel::detectCores() - 2L),
  MaxMemory = "28G")
# }

Run the code above in your browser using DataLab