A function to optimize hyperparameters used in the General Imposters method
(see link{imposters}
for further details). Using a grid search approach,
it tries to define a grey area where the attribution scores are not reliable.
imposters.optimize(reference.set,
classes.reference.set = NULL,
parameter.incr = 0.01,
...)
a table containing frequencies/counts for several
variables -- e.g. most frequent words -- across a number of texts
written by different authors. Usually, it is a corpus of known authors
(at least two tests per author) that is used to tune the optimal
hyperparameters for the imposters method. Such a tuning involves
a leave-one-out procedure of identifying a gray area when the
results returned by the classifier are not particularly reliable.
E.g., if one gets 0.39 and 0.55 as the parameters, one would assume
that any results of the imposters()
function that lay within
this range should be claimed unreliable.
Make sure that the rows contain samples, and the columns --
variables (words, n-grams, or whatever needs to be analyzed).
a vector containing class identifiers for the reference set. When missing, the row names of the set table will be used; the assumed classes are the strings of characters followed by the first underscore. Consider the following example: c("Sterne_Tristram", "Sterne_Sentimental", "Fielding_Tom", ...), where the classes are the authors' names. Note that only the part up to the first underscore in the sample's name will be included in the class label.
the procedure tries to optimize the hyperparameters via a grid search -- this means that it tests the range of values between 0 and 1 incremented by a certain fraction. If this is set to 0.01 (default), it test 0, 0.01, 0.02, 0.03, ...
any other argument that can be passed to the classifier; see
perform.delta
for the parameters to be tweaked. In the current
version of the function, only the distance measure used for computing
similarities between texts can be set. Available options so far: "delta"
(Burrows's Delta, default), "argamon" (Argamon's Linear Delta),
"eder" (Eder's Delta), "simple" (Eder's Simple Distance),
"canberra" (Canberra Distance), "manhattan" (Manhattan
Distance), "euclidean" (Euclidean Distance), "cosine"
(Cosine Distance), "wurzburg" (Cosine Delta), "minmax"
(Minmax Distance, also known as the Ruzicka measure).
The function returns two scores: the P1 and P2 values.
Koppel, M. , and Winter, Y. (2014). Determining if two documents are written by the same author. "Journal of the Association for Information Science and Technology", 65(1): 178-187.
Kestemont, M., Stover, J., Koppel, M., Karsdorp, F. and Daelemans, W. (2016). Authenticating the writings of Julius Caesar. "Expert Systems With Applications", 63: 86-96.
# NOT RUN {
# activating a dummy dataset, in our case: Harper Lee and her Southern colleagues
data(lee)
# running the imposters method against all the remaining authorial classes
imposters.optimize(lee)
# }
Run the code above in your browser using DataLab