Synthetic dataset generated from a multivariate normal distribution with strong correlation structure (\(\rho = 0.8\)). It contains 550 observations and 10 variables of mixed type (continuous, categorical, binary, and weights). The last 50 rows correspond to contaminated observations created by adding perturbations equal to three times the standard deviation of each quantitative variable to a subset of original units. This results in a controlled 10% contamination level. These data follow the design in boj2024robustificationdbrobust.
Data_HC_contaminationA data frame with 550 rows and 10 variables:
Continuous variable 1
Continuous variable 2
Continuous variable 3
Continuous variable 4
Categorical variable 1 (3 categories, approx. balanced)
Categorical variable 2 (3 categories, approx. balanced)
Categorical variable 3 (4 categories, uniform distribution)
Binary variable 1 (40% zeros, 60% ones)
Binary variable 2 (60% zeros, 40% ones)
Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.
Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Contaminated observations (rows 501–550) were generated by perturbing original cases with fluctuations of 3 SD.
The weighting scheme prioritizes frequent category combinations.
boj2024robustificationdbrobust