Learn R Programming

siftr (version 1.1.0)

sift: Find relevant variables in a dataframe using fuzzy searches

Description

It can be hard to find the right column in a dataframe with hundreds or thousands of columns. This function gives you interactive, flexible searching through a dataframe, suggesting columns that are relevant to your query and showing some basic summary stats about what they contain.

Usage

sift(.df, ..., .dist = 0, .rebuild = FALSE)

Value

Invisibly returns a dataframe. The contents of that dataframe depend on the query:

  • If ... is empty, the full data dictionary for df is returned.

  • If the query was matched, only returns matching rows of the data dictionary.

  • If the query was not matched, return no rows of the dictionary (but all columns).

Arguments

.df

(Dataframe) A dataframe to search through.

...

(Dots) Search query. Case-insensitive. See Details for more information.

.dist

(Numeric) The maximum distance allowed for a match when searching fuzzily. See max.distance in agrep(). In short: a proportion is the fraction of the pattern length that can be flexible (e.g. 0.1 = 10% of the pattern length), whereas whole integers > 0 are the number of characters that can be flexible. 0 is an exact search.

.rebuild

(Logical) If TRUE, then force a dictionary rebuild even if it normally would not be triggered. Rebuilds are triggered by changes to a dataframe's dimensions, its columns (names, types, order), and/or its count of NAs in each column.

Details

You have three ways to search with sift(): exact search, fuzzy search, or orderless search (also called look-around search).

  • Exact search looks for exact matches to your query. For example, searching for "weight of" will only match weight of.

  • Fuzzy search gives you results that are close, but not exact, matches to your query. This is useful because real-world labelling is not always consistent or even correct, so using a fuzzy search for "baseline" will helpfully match baseline or base line or even OCR errors or typos like basellne.

  • Orderless search matches keywords regardless of the order you give them. This means that you can ask for cow, number and get a match for number of cows. This is useful when you have an idea of what keywords should be in a variable label, but not how those keywords are actually used or phrased. Note that this is not a fuzzy search, so the keywords have to match exactly.

The search that's performed depends on ... and .dist:

  • Orderless search is always used when you pass more than one query term into ....

  • Exact search is done when .dist = 0.

  • Fuzzy search must be opted-into by setting the .dist argument to a value > 0. It is ignored in orderless searching.

See Also

save_dictionary(), options_sift()

Examples

Run this code
# \donttest{
sift(mtcars_lab)  # Builds a dictionary without searching.

sift(mtcars_lab, .)  # Show everything up to the print limit (by default, 25 matches).

sift(mtcars_lab, mileage)  # Exact search for "mileage".
sift(mtcars_lab, "above avg", .dist = 1)  # Fuzzy search (here, space -> underscore).

sift(mtcars_lab, "na", "column")  # Orderless searches are always exact.

sift(mtcars_lab, "date|time")  # Regular expression
sift(mtcars_lab, "cyl|gear", number)  # Orderless search with regular expression
# }

Run the code above in your browser using DataLab