corp_coco: Co-occurrence comparison

Description

Calculates statistically significant difference in co-occurrence counts.

Usage

corp_coco(A, B, nodes, collocates = NULL, fdr = 0.01)
  # Deprecated
  coco(A, B, nodes, fdr = 0.01, collocates = NULL)

Value

A data.table of the form


    Classes ‘data.table’ and 'data.frame': 11 variables:
     $ x           : chr
     $ y           : chr
     $ H_A         : int
     $ M_A         : int
     $ H_B         : int
     $ M_B         : int
     $ effect_size : num
     $ CI_lower    : num
     $ CI_upper    : num
     $ p_value     : num
     $ p_adjusted  : num
     - attr(*, "sorted")= chr  "x" "y"
     - attr(*, ".internal.selfref")=<externalptr> 
     - attr(*, "coco_metadata")=List of 5
      ..$ nodes      : chr
      ..$ collocates : chr
      ..$ fdr        : num
      ..$ PACKAGE_VERSION:Classes 'package_version', 'numeric_version'
      .. ..$ : int
      ..$ date  : Date, format: "2016-11-01"

Arguments

A: A corp_cooccurrence object. For the deprecated coco function this is a data.frame of co-occurrence counts as returned by corp_get_counts.
B: A corp_cooccurrence object. For the deprecated coco function this is a data.frame of co-occurrence counts as returned by corp_get_counts.
nodes: A character vector of node types or character string representing a single node type.
collocates: A character vector of collocates types or character string representing a single collocate type. The collocates essentially act as a filter on the y column of the returned data structure. collocates should be used to target the testing; reducing the number of tests will reduce the loss of power from the multiple test correction.
fdr: The desired level at which to control the False Discovery Rate. Default value is 0.01.

Details

The corp_coco function implements the method introduced in Wiegand and Hennessey et al. (2017a) (described in more detail from a linguistic perspective in Wiegand, 2019).

fdr indicates the level at which the False Discovery Rate will be controlled because the method carries out a large number of tests. For a description of the form of FDR used see Benjamini and Hochberg (1995). For description of the p_adjusted column in the returned structure see p.adjust.

The returned data structure is a data.table. A data.table is also a data.frame and will behave exactly as such if the data.table library is not loaded.

The returned data.table contains details of all the co-occurrences for which there is evidence of a difference in co-occurrence between the two supplied data sets. The effect size is calculated as the log base 2 of the odds ratio. The effects size and its confidence interval are captured in the effect_size, CI_lower and CI_upper columns. The p_value column contains the non-adjusted p-value from the Fisher's Exact Test.

References

Y. Benjamini and Y. Hochberg (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57 (1)289–300.

* Wiegand, V., Hennessey, A., Tench, C. R., & Mahlberg, M. (2017a, May 24). Comparing co-occurrences between corpora. 38th ICAME conference, Charles University, Prague. * Wiegand, V. (2019). A Corpus Linguistic Approach to Meaning-Making Patterns in Surveillance Discourse [PhD, University of Birmingham]. https://etheses.bham.ac.uk/id/eprint/9778