performance_nan_imputation: Function to evaluate nan imputation method's performance

Description

This function evaluates the performance of various missing value imputation methods in a quantitative dataframe. It is designed to examine and compare five different imputation methods using standard performance measures

Usage

performance_nan_imputation(data, to_impute, regressors, method = 1)

Value

The function returns a dataframe that contains a row for each imputation method and columns with performance measures. The performance measures included are:

R^2: Coefficient of Determination, which measures how well the imputed values fit the observed values

RMSE: Root Mean Squared Error, which provides a measure of the mean square deviation between imputed and observed values

MAE: Mean Absolute Error, which represents the mean absolute deviation between the imputed and observed values

Arguments

data

A dataframe containing the observations (rows) and quantitative variables (columns) to be analyzed. This dataframe includes variables with missing values to be imputed

to_impute

A string specifying the name of the variable in the dataframe that contains the missing values to be imputed

regressors

A vector of strings indicating the names of the variables to be used as regressors for imputation in the case of methods 1 (lm_imputation) and 4 (hot deck imputation)

method

An integer between 1 and 5 that specifies the imputation method to be used. The supported methods are:

1: lm_imputation (Imputation by linear model)

2: median imputation (imputation by median)

3: mean imputation (imputation by mean)

4: hot deck imputation (imputation via hot deck)

5: EM imputation (imputation via Expectation-Maximization)

Details

This function is useful for comparing the effectiveness of different methods of imputing missing values, allowing the most appropriate method to be chosen based on measured performance

References

OECD/European Union/EC-JRC (2008), Handbook on Constructing Composite Indicators: Methodology and User Guide, OECD Publishing, Paris, <https://doi.org/10.1787/9789264043466-en>

Examples

Run this code


data("airquality")
regressors<-colnames(airquality[,c(3,4)])

#---Methods 1 = Imputation by linear model
performance_nan_imputation(data =airquality,"Ozone",regressors = regressors,method = 1)

#---Methods 2 = Imputation by Median
suppressWarnings(performance_nan_imputation(data =airquality,"Ozone",method = 2))

#---Methods 3 = Imputation by Mean
suppressWarnings(performance_nan_imputation(data =airquality,"Ozone",method = 3))

#---Methods 4 = Hot Deck imputation
performance_nan_imputation(data =airquality,"Ozone",regressors = regressors,method = 4)

#---Methods 5 = Expectation-Maximization imputation
performance_nan_imputation(data =airquality,"Ozone",regressors = regressors,method = 5)

Run the code above in your browser using DataLab