Learn R Programming

pointblank (version 0.3.0)

rows_distinct: Verify that row data are distinct

Description

The rows_distinct() validation step function checks whether row values (optionally constrained to a selection of specified columns) are, when taken as a complete unit, distinct from all other units in the table. This function can be used directly on a data table or with an agent object (technically, a ptblank_agent object). This validation step will operate over the number of test units that is equal to the number of rows in the table (after any preconditions have been applied).

Usage

rows_distinct(
  x,
  columns = NULL,
  preconditions = NULL,
  actions = NULL,
  brief = NULL
)

Arguments

x

A data frame, tibble, or an agent object of class ptblank_agent.

columns

The column (or a set of columns, provided as a character vector) to which this validation should be applied.

preconditions

expressions used for mutating the input table before proceeding with the validation. This is ideally as a one-sided R formula using a leading ~. In the formula representation, the tbl serves as the input data table to be transformed (e.g., ~ tbl %>% dplyr::mutate(col = col + 10). A series of expressions can be used by enclosing the set of statements with { } but note that the tbl object must be ultimately returned.

actions

A list containing threshold levels so that the validation step can react accordingly when exceeding the set levels. This is to be created with the action_levels() helper function.

brief

An optional, text-based description for the validation step.

Value

Either a ptblank_agent object or a table object, depending on what was passed to x.

Function ID

2-15

Details

We can specify the constraining column names in quotes, in vars(), and with the following tidyselect helper functions: starts_with(), ends_with(), contains(), matches(), and everything().

Having table preconditions means pointblank will mutate the table just before interrogation. It's isolated to the validation steps produced by this validation step function. Using dplyr code is suggested here since the statements can be translated to SQL if necessary. The code is to be supplied as a one-sided R formula (using a leading ~). In the formula representation, the obligatory tbl variable will serve as the input data table to be transformed (e.g., ~ tbl %>% dplyr::mutate(col_a = col_b + 10). A series of expressions can be used by enclosing the set of statements with { } but note that the tbl variable must be ultimately returned.

Often, we will want to specify actions for the validation. This argument, present in every validation step function, takes a specially-crafted list object that is best produced by the action_levels() function. Read that function's documentation for the lowdown on how to create reactions to above-threshold failure levels in validation. The basic gist is that you'll want at least a single threshold level (specified as either the fraction test units failed, or, an absolute value), often using the warn_at argument. This is especially true when x is a table object because, otherwise, nothing happens. For the col_vals_*()-type functions, using action_levels(warn_at = 0.25) or action_levels(stop_at = 0.25) are good choices depending on the situation (the first produces a warning when a quarter of the total test units fails, the other stop()s at the same threshold level).

Want to describe this validation step in some detail? Keep in mind that this is only useful if x is an agent. If that's the case, brief the agent with some text that fits. Don't worry if you don't want to do it. The autobrief protocol is kicked in when brief = NULL and a simple brief will then be automatically generated.

See Also

Other Validation Step Functions: col_exists(), col_is_character(), col_is_date(), col_is_factor(), col_is_integer(), col_is_logical(), col_is_numeric(), col_is_posix(), col_vals_between(), col_vals_equal(), col_vals_gte(), col_vals_gt(), col_vals_in_set(), col_vals_lte(), col_vals_lt(), col_vals_not_between(), col_vals_not_equal(), col_vals_not_in_set(), col_vals_not_null(), col_vals_null(), col_vals_regex(), conjointly()

Examples

Run this code
# NOT RUN {
library(dplyr)

# Create a simple table with three
# columns of numerical values
tbl <-
  tibble(
    a = c(5, 7, 6, 5, 8, 7),
    b = c(7, 1, 0, 0, 8, 3),
    c = c(1, 1, 1, 3, 3, 3)
  )

# Validate that when considering only
# data in columns `a` and `b`, there
# are no duplicate rows (i.e., all
# rows are distinct)
agent <-
  create_agent(tbl = tbl) %>%
  rows_distinct(vars(a, b)) %>%
  interrogate()

# Determine if these column
# validations have all passed
# by using `all_passed()`
all_passed(agent)

# }

Run the code above in your browser using DataLab