sjmisc (version 2.8.9)

str_find: Find partial matching and close distance elements in strings

Description

This function finds the element indices of partial matching or similar strings in a character vector. Can be used to find exact or slightly mistyped elements in a string vector.

Usage

str_find(string, pattern, precision = 2, partial = 0, verbose = FALSE)

Arguments

string

Character vector with string elements.

pattern

String that should be matched against the elements of string.

precision

Maximum distance ("precision") between two string elements, which is allowed to treat them as similar or equal. Smaller values mean less tolerance in matching.

partial

Activates similar matching (close distance strings) for parts (substrings) of the string. Following values are accepted:

  • 0 for no partial distance matching

  • 1 for one-step matching, which means, only substrings of same length as pattern are extracted from string matching

  • 2 for two-step matching, which means, substrings of same length as pattern as well as strings with a slightly wider range are extracted from string matching

Default value is 0. See 'Details' for more information.

verbose

Logical; if TRUE, the progress bar is displayed when computing the distance matrix. Default in FALSE, hence the bar is hidden.

Value

A numeric vector with index position of elements in string that partially match or are similar to pattern. Returns -1 if no match was found.

Details

Computation Details

Fuzzy string matching is based on regular expressions, in particular grep(pattern = "(<pattern>){~<precision>}", x = string). This means, precision indicates the number of chars inside pattern that may differ in string to cosinder it as "matching". The higher precision is, the more tolerant is the search (i.e. yielding more possible matches). Furthermore, the higher the value for partial is, the more matches may be found.

Partial Distance Matching

For partial = 1, a substring of length(pattern) is extracted from string, starting at position 0 in string until the end of string is reached. Each substring is matched against pattern, and results with a maximum distance of precision are considered as "matching". If partial = 2, the range of the extracted substring is increased by 2, i.e. the extracted substring is two chars longer and so on.

See Also

group_str

Examples

Run this code
# NOT RUN {
string <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic")
str_find(string, "hel")   # partial match
str_find(string, "stem")  # partial match
str_find(string, "R")     # no match
str_find(string, "saste") # similarity to "System"

# finds two indices, because partial matching now
# also applies to "Systemic"
str_find(string,
        "sytsme",
        partial = 1)

# finds partial matching of similarity
str_find("We are Sex Pistols!", "postils")
# }

Run the code above in your browser using DataCamp Workspace