Learn R Programming

RKorAPClient (version 1.0.0)

collocationAnalysis,KorAPConnection-method: Collocation analysis

Description

Performs a collocation analysis for the given node (or query) in the given virtual corpus.

Usage

# S4 method for KorAPConnection
collocationAnalysis(
  kco,
  node,
  vc = "",
  lemmatizeNodeQuery = FALSE,
  minOccur = 5,
  leftContextSize = 5,
  rightContextSize = 5,
  topCollocatesLimit = 200,
  searchHitsSampleLimit = 20000,
  ignoreCollocateCase = FALSE,
  withinSpan = ifelse(exactFrequencies, "base/s=s", ""),
  exactFrequencies = TRUE,
  stopwords = append(RKorAPClient::synsemanticStopwords(), node),
  seed = 7,
  expand = length(vc) != length(node),
  maxRecurse = 0,
  addExamples = FALSE,
  thresholdScore = "logDice",
  threshold = 2,
  localStopwords = c(),
  collocateFilterRegex = "^[:alnum:]+-?[:alnum:]*$",
  ...
)

Value

Tibble with top collocates, association scores, corresponding URLs for web user interface queries, etc.

Arguments

kco

KorAPConnection() object (obtained e.g. from new("KorAPConnection")

node

target word

vc

string describing the virtual corpus in which the query should be performed. An empty string (default) means the whole corpus, as far as it is license-wise accessible.

lemmatizeNodeQuery

if TRUE, node query will be lemmatized, i.e. x -> [tt/l=x]

minOccur

minimum absolute number of observed co-occurrences to consider a collocate candidate

leftContextSize

size of the left context window

rightContextSize

size of the right context window

topCollocatesLimit

limit analysis to the n most frequent collocates in the search hits sample

searchHitsSampleLimit

limit the size of the search hits sample

ignoreCollocateCase

logical, set to TRUE if collocate case should be ignored

withinSpan

KorAP span specification (see https://korap.ids-mannheim.de/doc/ql/poliqarp-plus?embedded=true#spans) for collocations to be searched within. Defaults to base/s=s.

exactFrequencies

if FALSE, extrapolate observed co-occurrence frequencies from frequencies in search hits sample, otherwise retrieve exact co-occurrence frequencies

stopwords

vector of stopwords not to be considered as collocates

seed

seed for random page collecting order

expand

if TRUE, node and vc parameters are expanded to all of their combinations

maxRecurse

apply collocation analysis recursively maxRecurse times

addExamples

If TRUE, examples for instances of collocations will be added in a column example. This makes a difference in particular if node is given as a lemma query.

thresholdScore

association score function (see association-score-functions) to use for computing the threshold that is applied for recursive collocation analysis calls

threshold

minimum value of thresholdScore function call to apply collocation analysis recursively

localStopwords

vector of stopwords that will not be considered as collocates in the current function call, but that will not be passed to recursive calls

collocateFilterRegex

allow only collocates matching the regular expression

...

more arguments will be passed to collocationScoreQuery()

Details

The collocation analysis is currently implemented on the client side, as some of the functionality is not yet provided by the KorAP backend. Mainly for this reason it is very slow (several minutes, up to hours), but on the other hand very flexible. You can, for example, perform the analysis in arbitrary virtual corpora, use complex node queries, and look for expression-internal collocates using the focus function (see examples and demo).

To increase speed at the cost of accuracy and possible false negatives, you can decrease searchHitsSampleLimit and/or topCollocatesLimit and/or set exactFrequencies to FALSE.

Note that some outdated non-DeReKo back-ends might not yet support returning tokenized matches (warning issued). In this case, the client library will fall back to client-side tokenization which might be slightly less accurate. This might lead to false negatives and to frequencies that differ from corresponding ones acquired via the web user interface.

See Also

Other collocation analysis functions: association-score-functions, collocationScoreQuery,KorAPConnection-method, synsemanticStopwords()

Examples

Run this code
if (FALSE) {

 # Find top collocates of "Packung" inside and outside the sports domain.
 new("KorAPConnection", verbose = TRUE) %>%
  collocationAnalysis("Packung", vc=c("textClass=sport", "textClass!=sport"),
                      leftContextSize=1, rightContextSize=1, topCollocatesLimit=20) %>%
  dplyr::filter(logDice >= 5)
}

if (FALSE) {

# Identify the most prominent light verb construction with "in ... setzen".
# Note that, currently, the use of focus function disallows exactFrequencies.
new("KorAPConnection", verbose = TRUE) %>%
  collocationAnalysis("focus(in [tt/p=NN] {[tt/l=setzen]})",
    leftContextSize=1, rightContextSize=0, exactFrequencies=FALSE, topCollocatesLimit=20)
}

Run the code above in your browser using DataLab