udpipe (version 0.3)

cooccurrence: Create a cooccurence data.frame

Description

A cooccurence data.frame indicates how many times each term co-occurs with another term.

There are 3 types of cooccurrences:

  • Looking at which words are in the same document/sentence/paragraph.

  • Looking at which words are followed by the next word

  • Looking at which words are in the neighbourhood of the word (before or after also known as skipgrams)

The result is a data.frame with fields term1, term2 and cooc where cooc indicates how many times term1 and term2 co-occurred. The dataset can be constructed

  • based upon a data frame where you look within a group if 2 terms occurred.

  • based upon a vector of words in which case we look how many times each word is followed by another word.

  • based upon a vector of words in which case we look how many times each word is followed or preceded by another word.

You can also aggregate cooccurrences if you decide to do any of these 3 by a certain group and next want to have an overall aggregate

Usage

cooccurrence(x, order = TRUE, ...)

# S3 method for character cooccurrence(x, order = TRUE, ...)

# S3 method for cooccurrence cooccurrence(x, order = TRUE, ...)

# S3 method for data.frame cooccurrence(x, order = TRUE, ..., group, term)

Arguments

x

either

  • a data.frame where the data.frame contains 1 row per document/term, in which case you need to provide group and term. This uses cooccurrence.data.frame.

  • a character vector with terms. This uses cooccurrence.character.

  • an object of class cooccurrence.This uses cooccurrence.cooccurrence.

order

logical indicating if we need to sort the output from high cooccurrences to low coccurrences. Defaults to TRUE.

...

other arguments passed on to the methods

group

character string with a column in the data frame x. To be used if x is a data.frame.

term

character string with a column in the data frame x, containing 1 term per row. To be used if x is a data.frame.

Value

a data.frame with columns term1, term2 and cooc indicating for the combination of term1 and term2 how many times this combination occurred

Methods (by class)

  • character: Create a cooccurence data.frame based on a vector of terms

  • cooccurrence: Aggregate co-occurrence statistics by summing the cooc by term/term2

  • data.frame: Create a cooccurence data.frame based on a data.frame where you look within a document / sentence / paragraph / group if terms co-occur

Examples

Run this code
# NOT RUN {
data(brussels_reviews_anno)

## By document, which lemma's co-occur
x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
x <- cooccurrence(x, group = "doc_id", term = "lemma")
head(x)

## Which words follow each other
x <- c("A", "B", "A", "B", "c")
cooccurrence(x)

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "es")
x <- cooccurrence(x$lemma)
head(x)

## Which nouns follow each other in the same document
library(data.table)
x <- as.data.table(brussels_reviews_anno)
x <- subset(x, language == "nl" & xpos %in% c("NN"))
x <- x[, cooccurrence(lemma, order = FALSE), by = list(doc_id)]
head(x)

x_nodoc <- cooccurrence(x)
x_nodoc <- subset(x_nodoc, term1 != "appartement" & term2 != "appartement")
head(x_nodoc)
# }

Run the code above in your browser using DataLab