nested_nodes: Search for parents or children in tquery

Description

Should only be used inside of the tquery function. Enables searching for parents or children, either direct (depth = 1) or until a given depth (depth 2 for children and grandchildren, Inf (infinite) for all).

Usage

children(
  ...,
  g_id = NULL,
  label = NA,
  req = TRUE,
  depth = 1,
  connected = FALSE,
  fill = TRUE,
  block = FALSE,
  max_window = c(Inf, Inf),
  min_window = c(0, 0)
)
not_children(
  ...,
  g_id = NULL,
  depth = 1,
  connected = FALSE,
  max_window = c(Inf, Inf),
  min_window = c(0, 0)
)
parents(
  ...,
  g_id = NULL,
  label = NA,
  req = TRUE,
  depth = 1,
  connected = FALSE,
  fill = TRUE,
  block = FALSE,
  max_window = c(Inf, Inf),
  min_window = c(0, 0)
)
not_parents(
  ...,
  g_id = NULL,
  depth = 1,
  connected = FALSE,
  max_window = c(Inf, Inf),
  min_window = c(0, 0)
)
fill(
  ...,
  g_id = NULL,
  depth = Inf,
  connected = FALSE,
  max_window = c(Inf, Inf),
  min_window = c(0, 0)
)

Arguments

...

Accepts two types of arguments: name-value pairs for finding nodes (i.e. rows), and functions to look for parents/children of these nodes.

The name in the name-value pairs need to match a column in the data.table, and the value needs to be a vector of the same data type as the column. By default, search uses case sensitive matching, with the option of using common wildcards (* for any number of characters, and ? for a single character). Alternatively, flags can be used to to change this behavior to 'fixed' (__F), 'igoring case' (__I) or 'regex' (__R). See details for more information.

If multiple name-value pairs are given, they are considered as AND statements, but see details for syntax on using OR statements, and combinations.

To look for parents and children of the nodes that are found, you can use the parents and children functions as (named or unnamed) arguments. These functions have the same query arguments as tquery, but with some additional arguments.

g_id

Find nodes by global id, which is the combination of the doc_id, sentence and token_id. Passed as a data.frame or data.table with 3 columns: (1) doc_id, (2) sentence and (3) token_id.

label

A character vector, specifying the column name under which the selected tokens are returned. If NA, the column is not returned.

req

Can be set to false to not make a node 'required'. This can be used to include optional nodes in queries. For instance, in a query for finding subject - verb - object triples, make the object optional.

depth

A positive integer, determining how deep parents/children are sought. 1 means that only direct parents and children of the node are retrieved. 2 means children and grandchildren, etc. All parents/children must meet the filtering conditions (... or g_id)

connected

controlls behaviour if depth > 1 and filters are used. If FALSE, all parents/children to the given depth are retrieved, and then filtered. This way, grandchilden that satisfy the filter conditions are retrieved even if their parents do not satisfy the conditions. If TRUE, the filter is applied at each level of depth, so that only fully connected branches of nodes that satisfy the conditions are retrieved.

fill

Logical. If TRUE (default), the default fill() will be used (this is identical to nesting fill(); see description). To more specifically controll fill, you can nest the fill function (a special version of the children function).

block

Logical. If TRUE, the node will be blocked from being assigned (labeld). This is mainly usefull if you have a node that you do not want to be assigned by fill, but also don't want to 'label' it. Essentially, block is shorthand for using label and then removing the node afterwards. If block is TRUE, label has to be NA.

max_window

Set the max token distance of the children/parents to the node. Has to be either a numerical vector of length 1 for distance in both directions, or a vector of length 2, where the first value is the max distance to the left, and the second value the max distance to the right. Default is c(Inf, Inf) meaning that no max distance is used.

min_window

Like max_window, but for the min distance. Default is c(0,0) meaning that no min is used.

Value

Should not be used outside of tquery

Details

Searching for parents/children within find_nodes works as an AND condition: if it is used, the node must have these parents/children. The label argument is used to remember the global token ids (.G_ID) of the parents/children under a given column name.

the not_children and not_parents functions will make the matched children/parents a NOT condition.

The fill() function is used to include the children of a 'labeld' node. It can only be nested in a query if the label argument is not NULL, and by default will include all children of the node that have not been assigned to another node. If two nodes have a shared child, the child will be assigned to the closest node.

Having nested queries can be confusing, so we tried to develop the find_nodes function and the accompanying functions in a way that clearly shows the different levels. As shown in the examples, the idea is that each line is a node, and to look for parents or children, we put them on the next line with indentation (in RStudio, it should automatically allign correctly when you press enter inside of the children() or parents() functions).

There are several flags that can be used to change search condition. To specify flags, add a double underscore and the flag character to the name in the name value pairs (...). By adding the suffix __R, query terms are considered to be regular expressions, and the suffix __I uses case insensitive search (for normal or regex search). If the suffix __F is used, only exact matches are valid (case sensitive, and no wildcards). Multiple flags can be combined, such as lemma__RI, or lemma__IR (order of flags is irrelevant)