extracting_distribution: Build the `feat_dist` data frame for AutoTab

Description

Creates one row per original variable with columns:

column_name: variable name
distribution: one of "gaussian", "bernoulli", or "categorical"
num_params: number of decoder outputs the VAE should produce for that variable

Usage

extracting_distribution(data)

Value

A data frame with columns column_name, distribution, and num_params. Note: refer to feat_reorder().

Arguments

data: Data frame of the original (not preprocessed) variables.

Details

A variable is classified as:

bernoulli if it has exactly 2 unique values (any type)
categorical if it is a character/factor with more than 2 unique values
gaussian otherwise (e.g., numeric with >2 distinct values)

AutoTab is not built to handle missing data. A message will prompt the user if the data has NA values.

In AutoTab, the decoder outputs distribution-specific parameters for each variable, not reconstructed values directly. Therefore:

Continuous (Gaussian) variables output two parameters per feature: the mean (\(\mu\)) and the standard deviation (\(\sigma\)).
Binary (Bernoulli) variables output one parameter: the probability (p) of observing a 1.
Categorical variables output one parameter per category level: the probabilities corresponding to each possible class.

As a result, the decoder output matrix will typically have more columns than the original training data.

For example, if your original dataset has:

1 continuous variable   →  2 decoder parameters
1 binary variable       →  1 decoder parameter
1 categorical variable with 3 levels → 3 decoder parameters

The total number of decoder outputs will be 2 + 1 + 3 = 6, even though the input data has only 3 original variables.

AutoTab keeps track of this mapping internally through the feat_dist object, ensuring that the reconstruction loss and sampling functions correctly handle each distributional head.

Examples

Run this code

data_example <- data.frame(
  cont = rnorm(5),
  bin  = c(0,1,0,1,1),
  cat  = factor(c("A","B","C","A","C"))
)

feat_dist <- extracting_distribution(data_example)
print(feat_dist)
# column_name distribution num_params
# 1        cont      gaussian          2
# 2         bin     bernoulli          1
# 3         cat    categorical          3

# The decoder will therefore output 6 total columns (2+1+3)