Synthetic dataset generated from a multivariate normal distribution with strong correlation structure (\(\rho = 0.8\)). It contains 500 observations and 10 variables of mixed type (continuous, categorical, binary, and weights). No contaminated cases were added in this version, so the dataset represents a clean scenario with 0% contamination. These data follow the design in boj2024robustificationdbrobust.
Data_HC_no_contaminationA data frame with 500 rows and 10 variables:
Continuous variable 1
Continuous variable 2
Continuous variable 3
Continuous variable 4
Categorical variable 1 (3 categories, approx. balanced)
Categorical variable 2 (3 categories, approx. balanced)
Categorical variable 3 (4 categories, uniform distribution)
Binary variable 1 (40% zeros, 60% ones)
Binary variable 2 (60% zeros, 40% ones)
Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.
Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Unlike other datasets in this collection, no artificial contamination was introduced here.
The weighting scheme prioritizes frequent category combinations.
boj2024robustificationdbrobust