Estimating missing data with Machine Learning when Know correlation and Small Sample Sizes

Tracking #: 885-1865

Authors:
NameORCID
Noppakun ThongmualORCID logo https://orcid.org/0000-0002-0273-7634


Submission Type: 

Research Paper

Abstract: 

This paper presents a comparative analysis of K-Nearest Neighbors (K-NN), Support Vector Regression (SVR), Decision Trees (DT), and Random Forests (RF) for estimating loss values under varying conditions of missing data (5%, 10%, 15%) and correlation coefficients (ρ). The study aims to determine which method performs best under different scenarios of data sparsity and correlation. Our methodology involves calculating the average absolute error (AAE) for each method across different rates of missing data and ρ values. The results indicate that SVR achieves the lowest AAE at lower missing data rates and lower ρ values, whereas RF excels as the rate of missing data and ρ increase. Specifically, RF demonstrates superior performance with the lowest AAEs at higher missing data rates and higher ρ values, making it the most reliable method overall. The discussion highlights the robustness of RF in handling incomplete and correlated datasets, and its consistent performance compared to other methods. The study concludes by suggesting future research directions, including the development of hybrid models that combine the strengths of SVR and RF, and the exploration of various imputation techniques to enhance model performance. These findings are significant for improving loss estimation and decision-making in fields such as finance, healthcare, and engineering.

Manuscript: 

Tags: 

  • Reviewed

Data repository URLs: 

not have

Date of Submission: 

Wednesday, October 9, 2024

Date of Decision: 

Wednesday, October 16, 2024


Nanopublication URLs:

Decision: 

Reject (Pre-Screening)