SurfaceColloc: A small data set of surface collocations from the English Wikipedia
Description
This data set demonstrates how co-occurrence and marginal frequencies can be provided for collocation analysis with am.score.
It contains surface co-occurrence counts for 7 English nouns as nodes and 7 selected collocates. The counts are based on a collocational span of two tokens to the left and right of the node (L2/R2) in the WP500 corpus.
Marginal frequencies for the nodes are overall corpus frequencies of the nouns, so expected co-occurrence frequency needs to be adjusted with the total span size of 4 tokens.
Usage
SurfaceColloc
Arguments
Format
A list with the following components:
cooc:
A data frame with 34 rows and the following columns:
w1: node word (noun)
w2: collocate
f: co-occurrence frequency within L2/R2 span
f1:
Labelled integer vector of length 7 specifying the marginal frequencies of the node nouns.
f2:
Labelled integer vector of length 7 specifying the marginal frequencies of the collocates.
N:
Sample size, i.e. the total number of tokens in the WP500 corpus.