Learn R Programming

RSofia (version 1.1)

sofia: Fitting sofia-ml models

Description

sofia is used to fit classification and regression models provided by D. Sculley's sofia-ml.

Usage

sofia(x, ...)
"sofia"(x, data, random_seed = floor(runif(1, 1, 65535)), lambda = 0.1, iterations = 1e+05, learner_type = c("pegasos", "sgd-svm", "passive-aggressive", "margin-perceptron", "romma", "logreg-pegasos"), eta_type = c("pegasos", "basic", "constant"), loop_type = c("stochastic", "balanced-stochastic", "rank", "roc", "query-norm-rank", "combined-ranking", "combined-roc"), rank_step_probability = 0.5, passive_aggressive_c = 1e+07, passive_aggressive_lambda = 0, perceptron_margin_size = 1, training_objective = FALSE, hash_mask_bits = 0, verbose = FALSE, reserve = 0, ...)
"sofia"(x, random_seed = floor(runif(1, 1, 65535)), lambda = 0.1, iterations = 1e+05, learner_type = c("pegasos", "sgd-svm", "passive-aggressive", "margin-perceptron", "romma", "logreg-pegasos"), eta_type = c("pegasos", "basic", "constant"), loop_type = c("stochastic", "balanced-stochastic", "rank", "roc", "query-norm-rank", "combined-ranking", "combined-roc"), rank_step_probability = 0.5, passive_aggressive_c = 1e+07, passive_aggressive_lambda = 0, perceptron_margin_size = 1, training_objective = FALSE,no_bias_term = FALSE, dimensionality=150000, hash_mask_bits = 0, verbose = FALSE, buffer_mb = 40, ...)

Arguments

x
a formula object or a character with a path to a file
data
data to parse formula on, when model is specified via a formula
random_seed
an integer. Makes algorithm use this seed. Can be useful in testing and parameter tuning
lambda
a numeric scalar. Value of lambda for SVM regularization, used by both Pegasos SVM and SGD-SVM.
iterations
an integer. Number of stochastic gradient steps to take.
learner_type
a character string indicating which type of learner to use. One of "pegasos" (default), "sgd-svm", "passive-aggressive", "margin-perceptron", "romma", "logreg-pegasos"
eta_type
a character string indicating the type of update for learning rate to use. One of "pegasos" (default), "basic", "constant"
loop_type
a character string indicating the type of sampling loop to use for training. One of

"stochastic" - Perform normal stochastic sampling for stochastic gradient descent, for training binary classifiers. On each iteration, pick a new example uniformly at random from the data set.

"balanced-stochastic" - Perform a balanced sampling from positives and negatives in data set. For each iteration, samples one positive example uniformly at random from the set of all positives, and samples one negative example uniformly at random from the set of all negatives. This can be useful for training binary classifiers with a minority-class distribution.

"rank" - Perform indexed sampling of candidate pairs for pairwise learning to rank. Useful when there are examples from several different qid groups.

"roc" - Perform indexed sampling to optimize ROC Area.

"query-norm-rank" - Perform sampling of candidate pairs, giving equal weight to each qid group regardless of its size. Currently this is implemented with rejection sampling rather than indexed sampling, so this may run more slowly.

"combined-ranking" - Performs CRR algorithm for combined regression and ranking. Alternates between pairwise rank-based steps and standard stochastic gradient steps on single examples. Relies on "rank_step_probability" to balance between these two kinds of updates.

"combined-roc" - Performs CRR algorithm for combined regression and ROC area optimization. Alternates between pairwise roc-optimization-based steps and standard stochastic gradient steps on single examples. Relies on "rank_step_probability" to balance between these two kinds of updates. This can be faster than the combined-ranking option when there are exactly two classes.

rank_step_probability
a numeric scalar. Probability that we will take a rank step (as opposed to a standard stochastic gradient step) in a combined ranking or combined ROC loop.
passive_aggressive_c
a numeric scalar. Maximum size of any step taken in a single passive-aggressive update
passive_aggressive_lambda
a numeric scalar. Lambda for pegasos-style projection for passive-aggressive update. When set to 0 (default) no projection is performed.
perceptron_margin_size
Width of margin for perceptron with margins. Default of 1 is equivalent to unregularized SVM-loss
training_objective
logical. When TRUE, computes the value of the standard SVM objective function on training data, after training.
dimensionality
integer. Index id of largest feature index in training data set, plus one.
hash_mask_bits
an integer. When set to a non-zero value, causes the use of a hased weight vector with hashed cross product features. This allows learning on conjunction of features, at some increase in computational cost. Note that this flag must be set both in training and testing to function properly. The size of the hash table is set to 2^hash_mask_bits. default value of 0 shows that hash cross products are not used.
verbose
logical.
no_bias_term
logical. When set, causes a bias term x_0 to be set to 0 for every feature vector loaded from files, rather than the default of x_0 = 1. Setting this flag is equivalent to forcing a decision threshold of exactly 0 to be used. The same setting of this flag should be used for training and testing. Note that this flag as no effect for rank and roc optimzation. Default: not set. To set this flag using the formula interface use ( Y ~ -1 + . )
reserve
integer. experimental, should vector be explicity reserved for data?
buffer_mb
integer. Size of buffer to use in reading/writing to files, in MB.
...
items passed to methods.

Value

returns an object of class "sofia".An object of class "sofia" is a list containing at least the following components:
par
a list containing the parameters specified in training the model
weights
a numeric vector of the parameter weights (the model)
training_time
time used to fit the model (does not include io time)
If the method was called via the formula interface, it will additionally include:
formula
formula with the specification of the model

References

D. Sculley. Combined Regression and Ranking. Proceedings of the 16th Annual SIGKDD Conference on Knowledge Discover and Data Mining, 2010.

D. Sculley. Web-Scale K-Means Clustering. Proceedings of the 19th international conference on World Wide Web, 2010.

D. Sculley. Large Scale Learning to Rank. NIPS Workshop on Advances in Ranking, 2009. Presents the indexed sampling methods used learning to rank, including the rank and roc loops.

See Also

http://code.google.com/p/sofia-ml/

Examples

Run this code

data(irismod)

model.logreg <- sofia(Is.Virginica ~ ., data=irismod, learner_type="logreg-pegasos")

Run the code above in your browser using DataLab