This function conducts a stable lexical marker analysis.
slma(
x,
y,
file_encoding = "UTF-8",
sig_cutoff = qchisq(0.95, df = 1),
small_pos = 1e-05,
keep_intermediate = FALSE,
verbose = TRUE,
min_rank = 1,
max_rank = 5000,
keeplist = NULL,
stoplist = NULL,
ngram_size = NULL,
max_skip = 0,
ngram_sep = "_",
ngram_n_open = 0,
ngram_open = "[]",
...
)An object of class slma, which is a named list with at least the following
elements:
A scores dataframe with information about the stability of the chosen
lexical items. (See below.)
An intermediate list with a register of intermediate values if
keep_intermediate was TRUE.
Named items registering the values of the arguments with the same name,
namely sig_cutoff, small_pos, x, and y.
The slma object has as_data_frame() and print methods
as well as an ad-hoc details() method. Note that the print
method simply prints the main dataframe.
scores element
The scores element is a dataframe of which the rows are linguistic items
for which a stable lexical marker analysis was conducted and the columns are
different 'stability measures' and related statistics. By default, the
linguistic items are sorted by decreasing 'stability' according to the S_lor
measure.
| Column | Name | Computation | Range of values |
S_abs | Absolute stability | S_att - S_rep | \(-(n*m)\) -- \((n*m)\) |
S_nrm | Normalized stability | S_abs / \(n*m\) | -1 -- 1 |
S_att | Stability of attraction | Number of \((a,b)\) couples in which the linguistic item is a keyword for the A-documents | 0 -- \(n*m\) |
S_rep | Stability of repulsion | Number of \((a,b)\) couples in which the linguistic item is a keyword for the B-documents | 0 -- \(n*m\) |
S_lor | Log of odds ratio stability | Mean of log_OR across all \((a,b)\) couples but setting to 0 the value when p_G is larger than sig_cutoff |
S_lor is then computed as a fraction with as its numerator the sum of all
log_OR values across all \((a,b)\) couples for which p_G is lower than
sig_cutoff and as its denominator \(n*m\).
For more on log_OR, see the Value section on on assoc_scores(). The final
three columns on the output are meant as a tool in support of the interpretation
of the log_OR column. Considering all \((a,b)\) couples for which
p_G is smaller than sig_cutoff, lor_min, lor_max and lor_sd
are their minimum, maximum and standard deviation for each element.
Character vector or fnames object with filenames for the two
sets of documents.
Encoding of all the files to read.
Numeric value indicating the cutoff value for 'significance
in the stable lexical marker analysis. The default value is qchist(.95, df = 1),
which is about 3.84.
Alternative (but sometimes inferior) approach to dealing with
zero frequencies, compared to haldane. The argument small_pos
only applies when haldane is set to FALSE.
(See the Details section.)
If haldane is FALSE, and there is at least one zero frequency
in a contingency table, adding small positive values to the zero frequency
cells is done systematically for all measures calculated for that table,
not just for measures that need this to be done.
Logical. If TRUE, results from intermediate
calculations are kept in the output as the "intermediate" element. This is
necessary if you want to inspect the object with the details() method.
Logical. Whether progress should be printed to the console during analysis.
Minimum and maximum frequency rank in the first
corpus (x) of the items to take into consideration as candidate stable
markers. Only tokens or token n-grams with a frequency rank greater than or
equal to min_rank and lower than or equal to max_rank will be included.
List of types that must certainly be included in the list of
candidate markers regardless of their frequency rank and of stoplist.
List of types that must not be included in the list of candidate
markers, although, if a type is included in keeplist, its inclusion in
stoplist is disregarded.
Argument in support of ngrams/skipgrams (see also max_skip).
If one wants to identify individual tokens, the value of ngram_size
should be NULL or 1. If one wants to retrieve
token ngrams/skipgrams, ngram_size should be an integer indicating
the size of the ngrams/skipgrams. E.g. 2 for bigrams, or 3 for
trigrams, etc.
Argument in support of skipgrams. This argument is ignored if
ngram_size is NULL or is 1.
If ngram_size is 2 or higher, and max_skip
is 0, then regular ngrams are being retrieved (albeit that they
may contain open slots; see ngram_n_open).
If ngram_size is 2 or higher, and max_skip
is 1 or higher, then skipgrams are being retrieved (which in the
current implementation cannot contain open slots; see ngram_n_open).
For instance, if ngram_size is 3 and max_skip is
2, then 2-skip trigrams are being retrieved.
Or if ngram_size is 5 and max_skip is
3, then 3-skip 5-grams are being retrieved.
Character vector of length 1 containing the string that is used to separate/link tokens in the representation of ngrams/skipgrams in the output of this function.
If ngram_size is 2 or higher, and moreover
ngram_n_open is a number higher than 0, then
ngrams with 'open slots' in them are retrieved. These
ngrams with 'open slots' are generalizations of fully lexically specific
ngrams (with the generalization being that one or more of the items
in the ngram are replaced by a notation that stands for 'any arbitrary token').
For instance, if ngram_size is 4 and ngram_n_open is
1, and if moreover the input contains a
4-gram "it_is_widely_accepted", then the output will contain
all modifications of "it_is_widely_accepted" in which one (since
ngram_n_open is 1) of the items in this n-gram is
replaced by an open slot. The first and the last item inside
an ngram are never turned into an open slot; only the items in between
are candidates for being turned into open slots. Therefore, in the
example, the output will contain "it_[]_widely_accepted" and
"it_is_[]_accepted".
As a second example, if ngram_size is 5 and
ngram_n_open is 2, and if moreover the input contains a
5-gram "it_is_widely_accepted_that", then the output will contain
"it_[]_[]_accepted_that", "it_[]_widely_[]_that", and
"it_is_[]_[]_that".
Character string used to represent open slots in ngrams in the output of this function.
Additional arguments.
A stable lexical marker analysis of the A-documents in x versus the B-documents
in y starts from a separate keyword analysis for all possible document couples
\((a,b)\), with a an A-document and b a B-document. If there are n
A-documents and m B-documents, then \(n*m\) keyword analyses are
conducted. The 'stability' of a linguistic item x, as a marker for the
collection of A-documents (when compared to the B-documents) corresponds
to the frequency and consistency with which x is found to be a keyword for
the A-documents across all aforementioned keyword analyses.
In any specific keyword analysis, x is considered a keyword for an A-document
if G_signed is positive and moreover p_G is less than sig_cutoff
(see assoc_scores() for more information on the measures). Item x is
considered a keyword for the B-document if G_signed is negative and moreover
p_G is less than sig_cutoff.
a_corp <- get_fnames(system.file("extdata", "cleveland", package = "mclm"))
b_corp <- get_fnames(system.file("extdata", "roosevelt", package = "mclm"))
slma_ex <- slma(a_corp, b_corp)
Run the code above in your browser using DataLab