sofia: Fitting sofia-ml models

Description

sofia is used to fit classification and regression models provided by D. Sculley's sofia-ml.

Usage

sofia(x, ...) 
"sofia"(x, data, random_seed = floor(runif(1, 1, 65535)), lambda = 0.1,  iterations = 1e+05, learner_type = c("pegasos", "sgd-svm",  "passive-aggressive", "margin-perceptron", "romma", "logreg-pegasos"),  eta_type = c("pegasos", "basic", "constant"), loop_type = c("stochastic",  "balanced-stochastic", "rank", "roc", "query-norm-rank",  "combined-ranking", "combined-roc"), rank_step_probability = 0.5,  passive_aggressive_c = 1e+07, passive_aggressive_lambda = 0,  perceptron_margin_size = 1, training_objective = FALSE, hash_mask_bits = 0,  verbose = FALSE, reserve = 0, ...)
"sofia"(x, random_seed = floor(runif(1, 1, 65535)), lambda = 0.1,  iterations = 1e+05, learner_type = c("pegasos", "sgd-svm",  "passive-aggressive", "margin-perceptron", "romma", "logreg-pegasos"),  eta_type = c("pegasos", "basic", "constant"), loop_type = c("stochastic",  "balanced-stochastic", "rank", "roc", "query-norm-rank",  "combined-ranking", "combined-roc"), rank_step_probability = 0.5,  passive_aggressive_c = 1e+07, passive_aggressive_lambda = 0,  perceptron_margin_size = 1, training_objective = FALSE,no_bias_term = FALSE, dimensionality=150000, hash_mask_bits = 0,  verbose = FALSE, buffer_mb = 40, ...)

Arguments

a formula object or a character with a path to a file

data

data to parse formula on, when model is specified via a formula

random_seed

an integer. Makes algorithm use this seed. Can be useful in testing and parameter tuning

lambda

a numeric scalar. Value of lambda for SVM regularization, used by both Pegasos SVM and SGD-SVM.

iterations

an integer. Number of stochastic gradient steps to take.

learner_type

a character string indicating which type of learner to use. One of "pegasos" (default), "sgd-svm", "passive-aggressive", "margin-perceptron", "romma", "logreg-pegasos"

eta_type

a character string indicating the type of update for learning rate to use. One of "pegasos" (default), "basic", "constant"

loop_type

a character string indicating the type of sampling loop to use for training. One of

"stochastic" - Perform normal stochastic sampling for stochastic gradient descent, for training binary classifiers. On each iteration, pick a new example uniformly at random from the data set.

"balanced-stochastic" - Perform a balanced sampling from positives and negatives in data set. For each iteration, samples one positive example uniformly at random from the set of all positives, and samples one negative example uniformly at random from the set of all negatives. This can be useful for training binary classifiers with a minority-class distribution.

"rank" - Perform indexed sampling of candidate pairs for pairwise learning to rank. Useful when there are examples from several different qid groups.

"roc" - Perform indexed sampling to optimize ROC Area.

"query-norm-rank" - Perform sampling of candidate pairs, giving equal weight to each qid group regardless of its size. Currently this is implemented with rejection sampling rather than indexed sampling, so this may run more slowly.

"combined-ranking" - Performs CRR algorithm for combined regression and ranking. Alternates between pairwise rank-based steps and standard stochastic gradient steps on single examples. Relies on "rank_step_probability" to balance between these two kinds of updates.

"combined-roc" - Performs CRR algorithm for combined regression and ROC area optimization. Alternates between pairwise roc-optimization-based steps and standard stochastic gradient steps on single examples. Relies on "rank_step_probability" to balance between these two kinds of updates. This can be faster than the combined-ranking option when there are exactly two classes.

rank_step_probability

a numeric scalar. Probability that we will take a rank step (as opposed to a standard stochastic gradient step) in a combined ranking or combined ROC loop.

passive_aggressive_c

a numeric scalar. Maximum size of any step taken in a single passive-aggressive update

passive_aggressive_lambda

a numeric scalar. Lambda for pegasos-style projection for passive-aggressive update. When set to 0 (default) no projection is performed.

perceptron_margin_size

Width of margin for perceptron with margins. Default of 1 is equivalent to unregularized SVM-loss

training_objective

logical. When TRUE, computes the value of the standard SVM objective function on training data, after training.

dimensionality

integer. Index id of largest feature index in training data set, plus one.

hash_mask_bits

an integer. When set to a non-zero value, causes the use of a hased weight vector with hashed cross product features. This allows learning on conjunction of features, at some increase in computational cost. Note that this flag must be set both in training and testing to function properly. The size of the hash table is set to 2^hash_mask_bits. default value of 0 shows that hash cross products are not used.

verbose

logical.

no_bias_term

logical. When set, causes a bias term x_0 to be set to 0 for every feature vector loaded from files, rather than the default of x_0 = 1. Setting this flag is equivalent to forcing a decision threshold of exactly 0 to be used. The same setting of this flag should be used for training and testing. Note that this flag as no effect for rank and roc optimzation. Default: not set. To set this flag using the formula interface use ( Y ~ -1 + . )

reserve

integer. experimental, should vector be explicity reserved for data?

buffer_mb

integer. Size of buffer to use in reading/writing to files, in MB.

...

items passed to methods.

Value

par: a list containing the parameters specified in training the model
weights: a numeric vector of the parameter weights (the model)
training_time: time used to fit the model (does not include io time)
formula: formula with the specification of the model

References

D. Sculley. Combined Regression and Ranking. Proceedings of the 16th Annual SIGKDD Conference on Knowledge Discover and Data Mining, 2010.

D. Sculley. Web-Scale K-Means Clustering. Proceedings of the 19th international conference on World Wide Web, 2010.

D. Sculley. Large Scale Learning to Rank. NIPS Workshop on Advances in Ranking, 2009. Presents the indexed sampling methods used learning to rank, including the rank and roc loops.

Examples

Run this code


data(irismod)

model.logreg <- sofia(Is.Virginica ~ ., data=irismod, learner_type="logreg-pegasos")