SurfaceColloc

This data set demonstrates how co-occurrence and marginal frequencies can be provided for collocation analysis with <code>am.score</code>.
 It contains surface co-occurrence counts for 7 English nouns as nodes and 7 selected collocates. The counts are based on a collocational span of two tokens to the left and right of the node (L2/R2) in the WP500 corpus.
 Marginal frequencies for the nodes are overall corpus frequencies of the nouns, so expected co-occurrence frequency needs to be adjusted with the total span size of 4 tokens.

datasets

Utility functions for the statistical analysis of corpus frequency data.
This package is a companion to the open-source course "Statistical Inference:
A Gentle Introduction for Computational Linguists and Similar Creatures" ('SIGIL').

Stephanie Evert

corpora

Statistics and Data Sets for Corpus Frequency Data

SurfaceColloc function

A list with the following components: 
<dl>
 <dt><code>cooc</code>:</dt>
<dd>A data frame with 34 rows and the following columns:<ul>
<li><code>w1</code>: node word (noun)</li>
<li><code>w2</code>: collocate</li>
<li><code>f</code>: co-occurrence frequency within L2/R2 span</li>
</ul></dd> <dt><code>f1</code>:</dt>
<dd>Labelled integer vector of length 7 specifying the marginal frequencies of the node nouns.</dd> <dt><code>f2</code>:</dt>
<dd>Labelled integer vector of length 7 specifying the marginal frequencies of the collocates.</dd> <dt><code>N</code>:</dt>
<dd>Sample size, i.e. the total number of tokens in the WP500 corpus.</dd> 
</dl>

Format

Stephanie Evert (<a href="https://purl.org/stephanie.evert">https://purl.org/stephanie.evert</a>)

Author

This data set demonstrates how co-occurrence and marginal frequencies can be provided for collocation analysis with <code>am.score</code>.
 It contains surface co-occurrence counts for 7 English nouns as nodes and 7 selected collocates. The counts are based on a collocational span of two tokens to the left and right of the node (L2/R2) in the WP500 corpus.
 Marginal frequencies for the nodes are overall corpus frequencies of the nouns, so expected co-occurrence frequency needs to be adjusted with the total span size of 4 tokens.

A small data set of surface collocations from the English Wikipedia — SurfaceColloc

A list with the following components: 
<dl>
 <dt><code>cooc</code>:</dt>
<dd>A data frame with 34 rows and the following columns:<ul>
<li><code>w1</code>: node word (noun)</li>
<li><code>w2</code>: collocate</li>
<li><code>f</code>: co-occurrence frequency within L2/R2 span</li>
</ul></dd>

 <dt><code>f1</code>:</dt>
<dd>Labelled integer vector of length 7 specifying the marginal frequencies of the node nouns.</dd>

 <dt><code>f2</code>:</dt>
<dd>Labelled integer vector of length 7 specifying the marginal frequencies of the collocates.</dd>

 <dt><code>N</code>:</dt>
<dd>Sample size, i.e. the total number of tokens in the WP500 corpus.</dd>

 
</dl>

Stephanie Evert (<a href='https://purl.org/stephanie.evert'>https://purl.org/stephanie.evert</a>)

SurfaceColloc: A small data set of surface collocations from the English Wikipedia

Description

Usage

Arguments

Format

Author

See Also

Examples