A variable is classified as:
bernoulli if it has exactly 2 unique values (any type)
categorical if it is a character/factor with more than 2 unique values
gaussian otherwise (e.g., numeric with >2 distinct values)
AutoTab is not built to handle missing data. A message will prompt the user if the data has NA values.
In AutoTab, the decoder outputs distribution-specific parameters for each variable,
not reconstructed values directly. Therefore:
Continuous (Gaussian) variables output two parameters per feature:
the mean (\(\mu\)) and the standard deviation (\(\sigma\)).
Binary (Bernoulli) variables output one parameter:
the probability (p) of observing a 1.
Categorical variables output one parameter per category level:
the probabilities corresponding to each possible class.
As a result, the decoder output matrix will typically have more columns
than the original training data.
For example, if your original dataset has:
1 continuous variable → 2 decoder parameters
1 binary variable → 1 decoder parameter
1 categorical variable with 3 levels → 3 decoder parameters
The total number of decoder outputs will be 2 + 1 + 3 = 6, even though the
input data has only 3 original variables.
AutoTab keeps track of this mapping internally through the feat_dist object,
ensuring that the reconstruction loss and sampling functions correctly handle
each distributional head.