string_split2df: Splits a character vector into a data frame

Description

Splits a character vector and formats the resulting substrings into a data.frame

Usage

string_split2df(
  x,
  data = NULL,
  split = NULL,
  id = NULL,
  add.pos = FALSE,
  id_unik = TRUE,
  fixed = FALSE,
  ignore.case = FALSE,
  word = FALSE,
  envir = parent.frame(),
  dt = FALSE,
  ...
)
string_split2dt(
  x,
  data = NULL,
  split = NULL,
  id = NULL,
  add.pos = FALSE,
  id_unik = TRUE,
  fixed = FALSE
)

Value

It returns a data.frame or a data.table which will contain: i) obs: the observation index, ii) pos: the position of the text element in the initial string (optional, via add.pos), iii) the text element, iv) the identifier(s) (optional, only if id was provided).

Arguments

x: A character vector or a two-sided formula. If a two-sided formula, then the argument data must be provided since the variables will be fetched in there. A formula is of the form char_var ~ id1 + id2 where char_var on the left is a character variable and on the right id1 and id2 are identifiers which will be included in the resulting table. Alternatively, you can provide identifiers via the argument id.
data: Optional, only used if the argument x is a formula. It should contain the variables of the formula.
split: A character scalar. Used to split the character vectors. By default this is a regular expression. You can use flags in the pattern in the form flag1, flag2/pattern. Available flags are ignore (case), fixed (no regex), word (add word boundaries), magic (add interpolation with "{}"). Example: if "ignore/hello" and the text contains "Hello", it will be split at "Hello". Shortcut: use the first letters of the flags. Ex: "iw/one" will split at the word "one" (flags 'ignore' + 'word').
id: Optional. A character vector or a list of vectors. If provided, the values of id are considered as identifiers that will be included in the resulting table.
add.pos: Logical, default is FALSE. Whether to include the position of each split element.
id_unik: Logical, default is TRUE. In the case identifiers are provided, whether to trigger a message if the identifiers are not unique. Indeed, if the identifiers are not unique, it is not possible to reconstruct the original texts.
fixed: Logical, default is FALSE. Whether to consider the argument split as fixed (and not as a regular expression).
ignore.case: Logical scalar, default is FALSE. If TRUE, then case insensitive search is triggered.
word: Logical scalar, default is FALSE. If TRUE then a) word boundaries are added to the pattern, and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation. Example: if word = TRUE, then pattern = "The, mountain" will select strings containing either the word 'The' or the word 'mountain'.
envir: Environment in which to evaluate the interpolations if the flag "magic" is provided. Default is parent.frame().
dt: Logical, default is FALSE. Whether to return a data.table. See also the function string_split2dt.
...: Not currently used.

Functions

string_split2dt(): Splits a string vector and returns a data.table

Examples

Run this code


x = c("Nor rain, wind, thunder, fire are my daughters.",
      "When my information changes, I alter my conclusions.")

id = c("ws", "jmk")

# we split at each word
string_split2df(x, "[[:punct:] ]+")

# we add the 'id'
string_split2df(x, "[[:punct:] ]+", id = id)

# TO NOTE:
# - the second argument is `data`
# - when it is missing, the argument `split` becomes implicitly the second
# - ex: above we did not use `split = "[[:punct:] ]+"`

#
# using the formula

base = data.frame(text = x, my_id = id)
string_split2df(text ~ my_id, base, "[[:punct:] ]+")

#
# with 2+ identifiers

base = within(mtcars, carname <- rownames(mtcars))

# we have a message because the identifiers are not unique
string_split2df(carname ~ am + gear + carb, base, " +")

# adding the position of the words & removing the message
string_split2df(carname ~ am + gear + carb, base, " +", id_unik = FALSE, add.pos = TRUE)