Towards improving machine learning algorithms accuracy by benefiting from similarities between cases

Tracking #: 631-1611

Samih M. MostafaORCID logo

Submission Type: 

Research Paper


Preprocessing is a necessary core in data mining. Preprocessing involves handling missing values, outlier and noise removal, data normalization, etc. The problem with existing methods which handle missing values is that they deal with the whole data ignoring the characteristics of the data (e.g., clustered data). The paper on hand focuses on handling the missing values using machine learning methods taking into account the characteristic of the data. The proposed preprocessing method clusters the data, and then imputes the missing values in each cluster depending on the data belong to this cluster rather than the whole data. The author performed a comparative study of the proposed method and ten popular imputation methods namely mean, median, mode, KNN, IterativeImputer, IterativeSVD, Softimpute, Mice, Forimp, and Missforest. On analysing on four datasets with different number of clusters, sizes, and shapes, the empirical study showed better effectiveness from the point of view of imputation time, Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R2 score) (i.e., the similarity of the original removed value to the imputed one).



  • Reviewed

Data repository URLs: 

Date of Submission: 

Tuesday, April 28, 2020

Date of Decision: 

Wednesday, April 29, 2020


Reject (Pre-Screening)