bf_analyze: Analyze encoding options for data

Description

This function helps you choose appropriate bit allocations for encoding data. It auto-detects the data type and provides relevant analysis:

Numeric with decimals: trade-offs for floating point encoding, which exponent/significand combinations are adequate for your range and precision requirements.
Integer: (signed) integer encoding, how many bits are required.
Factor/character: category/enumeration encoding, which levels are in the data and how many bits are required
Logical: boolean encoding, do NA values require a second bit.

Usage

bf_analyze(
  x,
  range = NULL,
  decimals = NULL,
  min_bits = NULL,
  max_bits = 16L,
  fields = NULL,
  plot = FALSE
)

Value

An object of class bf_analysis with analysis results.

Arguments

x: A numeric, integer, logical, factor, character vector, or single layer SpatRaster to analyze. The type is auto-detected.
range: numeric(2)
optional target range c(min, max) to design for (float analysis only). Defaults to the actual data range.
decimals: integer(1)
optional decimal places of precision required (float analysis only).
min_bits: integer(1)
minimum total bits to display in the Pareto table output. Configurations with fewer bits are hidden. Default is NULL (show all).
max_bits: integer(1)
maximum total bits to consider. Defaults to 16.
fields: list
optional list specifying which configurations to analyze (float analysis only). See Details.
plot: logical(1)
whether to generate a plot (float analysis only). Default is FALSE.

Float analysis output

For numeric (float) data, the output table shows Pareto-optimal exponent/significand configurations. The columns are:

Exp, Sig, Total: Number of exponent bits, significand bits, and their sum. More exponent bits extend the representable range (at the cost of coarser resolution), while more significand bits improve resolution within each exponent band.
Underflow: Percentage of data values that fall below the smallest representable positive value. These values are rounded to zero.
Overflow: Percentage of data values that exceed the largest representable value. These values are clipped to the maximum.
Changed: Percentage of data values that change when encoded and decoded (i.e., that do not survive the round-trip exactly).
Min Res, Max Res: Smallest and largest step size between adjacent representable values. In minifloat encoding, resolution varies across the range: small values near zero have fine resolution (small steps), while large values have coarse resolution (large steps). A Max Res of 1.0 means that in the coarsest region, only integer values can be represented -- continuous input will be rounded to whole numbers.
RMSE: Root mean squared error between original and decoded values, computed over all non-NA data points.
Max Err: Largest absolute difference between any original value and its decoded counterpart.

Choosing a configuration

The table only shows Pareto-optimal configurations, i.e., those where no other configuration is strictly better on all quality metrics for the same or fewer total bits. To choose between them:

Check Underflow and Overflow first. Non-zero values indicate data loss at the extremes of your range. Adding exponent bits or using the range argument to widen the target range can help.
Compare RMSE and Max Err to your acceptable precision. If you specified decimals, look for configurations where Max Res is at most 10^(-decimals).
If Max Res is >= 1, decoded values in the upper range will appear as integers even if the input was continuous. This may or may not be acceptable depending on your application.

Specifying configurations with <code>fields</code>

By default, all combinations up to max_bits are evaluated and only the Pareto front is shown. Use the fields argument to instead compare specific configurations:

fields = list(exponent = 4) shows all significand values paired with 4 exponent bits.
fields = list(exponent = c(3, 4), significand = c(5, 4)) compares exp=3/sig=5 and exp=4/sig=4.

Details

All of this can be applied both to columns in a table or layers in a SpatRaster. Use this before bf_map to understand your encoding options.

Examples

Run this code

# float analysis (numeric with decimals)
bf_analyze(bf_tbl$yield)

# with specific decimal precision requirement
bf_analyze(bf_tbl$yield, decimals = 2)

# design for a larger range than current data
bf_analyze(bf_tbl$yield, range = c(0, 20))

# with visualization
bf_analyze(bf_tbl$yield, decimals = 2, plot = TRUE)

# compare specific configurations
bf_analyze(bf_tbl$yield, fields = list(exponent = c(2, 3, 4), significand = c(5, 4, 3)))

# show all combinations for a specific exponent
bf_analyze(bf_tbl$yield, fields = list(exponent = 4))

# integer analysis
bf_analyze(as.integer(c(0, 5, 10, 100)))

# category/enum analysis
bf_analyze(bf_tbl$commodity)

# boolean analysis
bf_analyze(c(TRUE, FALSE, TRUE, NA))

# raster with attribute table
library(terra)
r <- rast(nrows = 3, ncols = 3, vals = c(0, 1, 2, 0, 1, 2, 0, 1, 2))
levels(r) <- data.frame(id = 0:2, label = c("low", "medium", "high"))
bf_analyze(r)

Run the code above in your browser using DataLab