This dataset, included in the MLwrap package, is a simulated dataset (Martínez-García et al., 2025) designed to capture relationships among psychological and demographic variables influencing psychological wellbeing, the primary outcome variable. It comprises data for 1,000 individuals.
data(sim_data)A data frame with 1,000 rows and 10 columns:
Psychological Wellbeing Indicator. Continuous with (0,100)
Psychological Wellbeing Binary Indicator. Factor with ("Low", "High")
Psychological Wellbeing Polytomic Indicator. Factor with ("Low", "Somewhat", "Quite a bit", "Very Much")
Patient Gender. Factor ("Female", "Male")
Patient Age. Continuous (18, 85)
Socioeconomial Status Indicator. Factor ("Low", "Medium", "High")
Emotional Intelligence Indicator. Continuous (24, 120)
Resilience Indicator. Continuous (4, 20)
Depression Indicator. Continuous (0, 63)
Life Satisfaction Indicator. Continuous (5, 35)
If machine learning models, including SVMs, show better evaluation metrics on the test set than the training set, this anomaly usually signals methodological issues rather than genuine model quality. Typical causes reported in the literature (Hastie et al., 2017) include:
Statistical variance in small samples: Random train-test splits may produce partitions where the test set contains easier-to-classify examples by chance, especially with small sample sizes or difficult tasks (Vabalas et al., 2019; An et al., 2021).
Synthetic data characteristics: Simulated data may contain artificial patterns or non-uniform distributions that create easier test sets compared to training sets.
Excessive regularization: High regularization parameters may limit model capacity to fit training data while paradoxically generalizing better to simpler test patterns, indicating underfitting.
Train-test contamination: Preprocessing (scaling, normalization) performed before train-test split leaks statistical information from test to train, producing overoptimistic performance estimates (Kapoor & Narayanan, 2023).
Kernel-data interaction: Inappropriate kernel parameters may create decision boundaries that better fit test distribution than training distribution.
MLwrap implementation:
MLwrap's hyperparameter optimization (via Bayesian Optimization or Grid
Search CV) implements 5-fold cross-validation during the tuning process,
which provides more robust parameter selection than single train-test
splits. Users should examine evaluation metrics across both training and
test sets, and review diagnostic plots (residuals, predictions) to identify
potential distribution differences between partitions. When working with
small datasets where partition variability may be substantial, running the
complete workflow with different random seeds can help assess the stability
of results and conclusions. The sim_data dataset included in MLwrap
is a simulated matrix provided for demonstration purposes only. As
synthetic data, it may occasionally exhibit some of these anomalous
phenomena (e.g., better test than training performance) due to artificial
patterns in the data generation process. Users working with real-world data
should always verify results through careful examination of evaluation
metrics and diagnostic plots across multiple runs.
The predictor variables include gender (50.7% female), age (range: 18-85 years, mean = 51.63, median = 52, SD = 17.11), and socioeconomic status, categorized as Low (n = 343), Medium (n = 347), and High (n = 310). Additional predictors (features) are emotional intelligence (range: 24-120, mean = 71.97, median = 71, SD = 23.79), resilience (range: 4-20, mean = 11.93, median = 12, SD = 4.46), life satisfaction (range: 5-35, mean = 20.09, median = 20, SD = 7.42), and depression (range: 0-63, mean = 31.45, median = 32, SD = 14.85). The primary outcome variable is emotional wellbeing, measured on a scale from 0 to 100 (mean = 50.22, median = 49, SD = 24.45).
The dataset incorporates correlations as conditions for the simulation. Psychological wellbeing is positively correlated with emotional intelligence (r = 0.50), resilience (r = 0.40), and life satisfaction (r = 0.60), indicating that higher levels of these factors are associated with better emotional health outcomes. Conversely, a strong negative correlation exists between depression and psychological wellbeing (r = -0.80), suggesting that higher depression scores are linked to lower emotional wellbeing. Age shows a slight positive correlation with emotional wellbeing (r = 0.15), reflecting the expectation that older individuals might experience greater emotional stability. Gender and socioeconomic status are included as potential predictors, but the simulation assumes no statistically significant differences in psychological wellbeing across these categories.
Additionally, the dataset includes categorical transformations of psychological wellbeing into binary and polytomous formats: a binary version ("Low" = 477, "High" = 523) and a polytomous version with four levels: "Low" (n = 161), "Somewhat" (n = 351), "Quite a bit" (n = 330), and "Very much" (n = 158). The polytomous transformation uses the 25th, 50th, and 75th percentiles as thresholds for categorizing psychological wellbeing scores. These transformations enable analyses using machine learning models for regression (continuous outcome) and classification (binary or polytomous outcomes) tasks.
An, C., Park, Y. W., Ahn, S. S., Han, K., Kim, H., & Lee, S. K. (2021). Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results. PLOS ONE, 16(8), e0256152. tools:::Rd_expr_doi("10.1371/journal.pone.0256152")
Hastie, T., Tibshirani, R., & Friedman, J. (2017). The elements of statistical learning: Data mining, inference, and prediction (2nd ed., corrected 12th printing, Chapter 7). Springer. tools:::Rd_expr_doi("10.1007/978-0-387-84858-7")
Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 100804. tools:::Rd_expr_doi("10.1016/j.patter.2023.100804")
Martínez-García, J., Montaño, J. J., Jiménez, R., Gervilla, E., Cajal, B., Núñez, A., Leguizamo, F., & Sesé, A. (2025). Decoding Artificial Intelligence: A Tutorial on Neural Networks in Behavioral Research. Clinical and Health, 36(2), 77-95. tools:::Rd_expr_doi("10.5093/clh2025a13")
Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A. J. (2019). Machine learning algorithm validation with a limited sample size. PLOS ONE, 14(11), e0224365. tools:::Rd_expr_doi("10.1371/journal.pone.0224365")