partition: Convert columns of a data frame to Boolean or fuzzy sets (triangular, trapezoidal, or raised-cosine)

Description

Transform selected columns of a data frame into either dummy logical variables or membership degrees of fuzzy sets, while leaving all remaining columns unchanged. Each transformed column typically produces multiple new columns in the output.

Usage

partition(
  .data,
  .what = everything(),
  ...,
  .breaks = NULL,
  .labels = NULL,
  .na = TRUE,
  .keep = FALSE,
  .method = "crisp",
  .style = "equal",
  .style_params = list(),
  .right = TRUE,
  .span = 1,
  .inc = 1
)

Value

A tibble with .data transformed into Boolean or fuzzy predicates.

Arguments

.data: A data frame to be processed.
.what: A tidyselect expression (see tidyselect syntax) selecting the columns to transform.
...: Additional tidyselect expressions selecting more columns.
.breaks: Ignored if .method = "dummy". For other methods, either an integer (number of intervals/sets) or a numeric vector of breakpoints.
.labels: Optional character vector with labels used for new column names. If NULL, labels are generated automatically.
.na: If TRUE, adds an extra logical column for each source column containing NA values (e.g., x=NA).
.keep: If TRUE, keep original columns in the output.
.method: Transformation method for numeric columns: "dummy", "crisp", "triangle", or "raisedcos".
.style: Controls how breakpoints are determined when .breaks is an integer. Values correspond to methods in classInt::classIntervals(), e.g., "equal", "quantile", "kmeans", "sd", "hclust", "bclust", "fisher", "jenks", "dpih", "headtails", "maximum", "box". Defaults to "equal". Used only if .method = "crisp" and .breaks is a single integer.
.style_params: A named list of parameters passed to the interval computation method specified by .style. Used only if .method = "crisp" and .breaks is an integer.
.right: For "crisp", whether intervals are right-closed and left-open (TRUE), or left-closed and right-open (FALSE).
.span: Number of consecutive breaks forming a set. For "crisp", controls interval width. For "triangle"/"raisedcos", .span = 1 produces triangular sets, .span = 2 trapezoidal sets.
.inc: Step size for shifting breaks when generating successive sets. With .inc = 1, all possible sets are created; larger values skip sets.

Crisp transformation of numeric data

For .method = "crisp", numeric columns are discretized into a set of dummy logical variables, each representing one interval of values.

If .breaks is an integer, it specifies the number of intervals into which the column should be divided. The intervals are determined using the .style and .style_params arguments, allowing not only equal-width but also data-driven breakpoints (e.g., quantile or k-means based). The first and last intervals automatically extend to infinity.
If .breaks is a numeric vector, it specifies interval boundaries directly. Infinite values are allowed.

The .style argument defines how breakpoints are computed when .breaks is an integer. Supported methods (from classInt::classIntervals()) include:

"equal" – equal-width intervals across the column range (default);
"quantile" – equal-frequency intervals (see quantile() for additional parameters that may be passed through .style_params; note that the probs parameter is set automatically and should not be included in .style_params);
"kmeans" – intervals found by 1D k-means clustering (see kmeans() for additional parameters);
"sd" – intervals based on standard deviations from the mean;
"hclust" – hierarchical clustering intervals (see hclust() for additional parameters);
"bclust" – model-based clustering intervals (see e1071::bclust() for additional parameters);
"fisher" / "jenks" – Fisher–Jenks optimal partitioning;
"dpih" – kernel-based density partitioning (see KernSmooth::dpih() for additional parameters);
"headtails" – head/tails natural breaks;
"maximum" – maximization-based partitioning;
"box" – breaks at boxplot hinges.

Additional parameters for these methods can be passed through .style_params, which should be a named list of arguments accepted by the respective algorithm in classInt::classIntervals(). For example, when .style = "kmeans", one can specify .style_params = list(algorithm = "Lloyd") to request Lloyd's algorithm for k-means clustering.

With .span = 1 and .inc = 1, the generated intervals are consecutive and non-overlapping. For example, with .breaks = c(1, 3, 5, 7, 9, 11) and .right = TRUE, the intervals are \((1;3]\), \((3;5]\), \((5;7]\), \((7;9]\), and \((9;11]\). If .right = FALSE, the intervals are left-closed: \([1;3)\), \([3;5)\), etc.

Larger .span values produce overlapping intervals. For example, with .span = 2, .inc = 1, and .right = TRUE, intervals are \((1;5]\), \((3;7]\), \((5;9]\), \((7;11]\).

The .inc argument controls how far the window shifts along .breaks.

.span = 1, .inc = 2 → \((1;3]\), \((5;7]\), \((9;11]\).
.span = 2, .inc = 3 → \((1;5]\), \((9;11]\).

Fuzzy transformation of numeric data

For .method = "triangle" or .method = "raisedcos", numeric columns are converted into fuzzy membership degrees in \([0,1]\).

If .breaks is an integer, it specifies the number of fuzzy sets.
If .breaks is a numeric vector, it directly defines fuzzy set boundaries. Infinite values produce open-ended sets.

With .span = 1, each fuzzy set is defined by three consecutive breaks: membership is 0 outside the outer breaks, rises to 1 at the middle break, then decreases back to 0 — yielding triangular or raised-cosine sets.

With .span > 1, fuzzy sets use four consecutive breaks: membership increases between the first two, remains 1 between the middle two, and decreases between the last two — creating trapezoidal sets. Border shapes are linear for .method = "triangle" and cosine for .method = "raisedcos".

The .inc argument defines the step between break windows:

.span = 1, .inc = 1 → \((1;3;5)\), \((3;5;7)\), \((5;7;9)\), \((7;9;11)\).
.span = 2, .inc = 1 → \((1;3;5;7)\), \((3;5;7;9)\), \((5;7;9;11)\).
.span = 1, .inc = 3 → \((1;3;5)\), \((7;9;11)\).

Author

Michal Burda

Details

These transformations are most often used as a preprocessing step before calling dig() or one of its derivatives, such as dig_correlations(), dig_paired_baseline_contrasts(), or dig_associations().

The transformation depends on the column type:

logical column x is expanded into two logical columns: x=TRUE and x=FALSE;
factor column x with levels l1, l2, l3 becomes three logical columns: x=l1, x=l2, and x=l3;
numeric column x is transformed according to .method:
- .method = "dummy": the column is treated as a factor with one level per unique value, then expanded into dummy columns;
- .method = "crisp": the column is discretized into intervals (defined by .breaks, .style, and .style_params) and expanded into dummy columns representing those intervals;
- .method = "triangle" or .method = "raisedcos": the column is converted into one or more fuzzy sets, each represented by membership degrees in \([0,1]\) (triangular or raised-cosine shaped).

Details of numeric transformations are controlled by .breaks, .labels, .style, .style_params, .right, .span, and .inc.

Crisp partitioning is efficient and works well when attributes have distinct categories or clear boundaries.
Fuzzy partitioning is recommended for modeling gradual changes or uncertainty, allowing smooth category transitions at a higher computational cost.

Examples

Run this code

# Crisp transformation using equal-width bins
partition(CO2, conc, .method = "crisp", .breaks = 4)

# Crisp transformation using quantile-based bins
partition(CO2, conc, .method = "crisp", .breaks = 4, .style = "quantile")

# Crisp transformation using k-means clustering for breakpoints
partition(CO2, conc, .method = "crisp", .breaks = 4, .style = "kmeans")

# Crisp transformation using Lloyd algorithm for k-means clustering for breakpoints
partition(CO2, conc, .method = "crisp", .breaks = 4, .style = "kmeans",
          .style_params = list(algorithm = "Lloyd"))

# Fuzzy triangular transformation (default)
partition(CO2, conc:uptake, .method = "triangle", .breaks = 3)

# Raised-cosine fuzzy sets
partition(CO2, conc:uptake, .method = "raisedcos", .breaks = 3)

# Overlapping trapezoidal fuzzy sets (Ruspini condition)
partition(CO2, conc:uptake, .method = "triangle", .breaks = 3,
          .span = 2, .inc = 2)

# Different settings per column
CO2 |>
  partition(Plant:Treatment) |>
  partition(conc,
            .method = "raisedcos",
            .breaks = c(-Inf, 95, 175, 350, 675, 1000, Inf)) |>
  partition(uptake,
            .method = "triangle",
            .breaks = c(-Inf, 7.7, 28.3, 45.5, Inf),
            .labels = c("low", "medium", "high"))

Run the code above in your browser using DataLab