corp_cooccurrence: Calculate Co-occurrence Counts

Description

Calculates co-occurrence counts. For each co-occurrence the maximum possible number of co-occurrences is also calculated.

Usage

corp_surface(text, span, nodes = NULL, collocates = NULL)
  is.corp_cooccurrence(obj)
  is.corp_surface(obj)
  # deprecated
  surface(x, span, nodes = NULL, collocates = NULL)

Value

corp_surface

Returns a corp_surface object.

The corp_surface object can be interrogated using the corp_get_* accessor functions.

The corp_surface objects are used as arguments to the corp_coco) function.

Arguments

text

A corp_text object.

span

A character string defining the co-occurrence span. See Details.

nodes

A character vector of node types or character string representing a single node type. If supplied, only co-occurrences for the specified node types will be calculated. If nodes is not supplied co-occurrences will be calculated for the set of all node types. Restricting nodes can significantly reduce memory usage and execution times.

collocates

A character vector of collocate types or character string representing a single collocate type. If supplied, only co-occurrences for the specified collocate types will be calculated. If collocates is not supplied, co-occurrences will be calculated for all collocate types with a non-zero co-occurrence count. Restricting collocates can significantly reduce execution times.

obj

A corp_cooccurrence object as is returned by the corp_surface function.

x

In the deprecated surface function x is a vector of tokens. x is assumed to be an ordered vector of tokenized text. No processing will be applied to x prior to the co-occurrence count calculations.

Details

Surface co-occurrence

‘surface’ co-occurrence is easiest to describe with an example. The following is a span of '2LR', that is 2 to the left and 2 to the right.


    ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama")
          |___________|____|___________|

In this example the node “plan” would co-occur once each with the collocates “man” and “cat”, and twice with the collocate “a”.

Other examples of span:

span = '1L2R'


    ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama")
                 |____|____|___________|

span = '2R'


    ("a", "man", "a", "plan", "a", "cat", "a", "canal", "panama")
                      |____|___________|

For a detailed description of ‘surface’ co-occurrence see Evert (2008).

Co-occurrence barriers

NAs can be used to implement co-occurrence barriers eg if two NA characters are inserted into x at each sentence boundary then with span = 2 co-occurrences will not happen across sentences. See Evert (2008) for detailed description of co-occurrence barriers.

References

S. Evert (2008) Corpora and collocations. Corpus Linguistics: An International Handbook 1212–1248.

Examples

Run this code

    # =====================
    # surface co-occurrence
    # =====================

    x <- corp_text("A man, a plan, a canal -- Panama!")

    y <- corp_surface(x, span = "2R")
    corp_get_counts(y)

    ##         x      y H M
    ##  1:     a      a 2 4
    ##  2:     a  canal 1 5
    ##  3:     a    man 1 5
    ##  4:     a panama 1 5
    ##  5:     a   plan 1 5
    ##  6: canal panama 1 0
    ##  7:   man      a 1 1
    ##  8:   man   plan 1 1
    ##  9:  plan      a 1 1
    ## 10:  plan  canal 1 1

    # filter on nodes
    y <- corp_surface(x, span = '2R', nodes = c("canal", "man", "plan"))
    corp_get_counts(y)

    ##         x      y H M
    ##  1: canal panama 1 0
    ##  2:   man      a 1 1
    ##  3:   man   plan 1 1
    ##  4:  plan      a 1 1
    ##  5:  plan  canal 1 1

    # filter on nodes and collocates
    y <- corp_surface(x, span = '2R', nodes = c("canal", "man", "plan"),
                      collocates = c("panama", "a"))
    corp_get_counts(y)

    ##         x      y H M
    ##  1: canal panama 1 0
    ##  2:   man      a 1 1
    ##  3:  plan      a 1 1

    # co-occurrence barrier
    tokens_with_barrier <- data.frame(
         type =            c("a", "man", "a", "plan", NA, NA, "a", "canal", "panama"),
        start = as.integer(c( 1,   3,     8,   10,    NA, NA,  16,  18,      27)),
          end = as.integer(c( 1,   5,     8,   13,    NA, NA,  16,  22,      32)),
        stringsAsFactors = FALSE
    )
    x <- corp_text("A man, a plan, a canal -- Panama!", tokens = tokens_with_barrier)

    y <- corp_surface(x, span = '2R')
    corp_get_counts(y)

    #         x      y H M
    #  1:     a      a 1 4
    #  2:     a  canal 1 4
    #  3:     a    man 1 4
    #  4:     a panama 1 4
    #  5:     a   plan 1 4
    #  6: canal panama 1 0
    #  7:   man      a 1 1
    #  8:   man   plan 1 1

Run the code above in your browser using DataLab