Abstract:
Timely graduation, commonly referred to as Graduate on Time (GOT), serves as a critical academic and institutional benchmark in higher education. While machine learning (ML) models have shown promise in identifying students at risk of delayed graduation, the inclusion of spatial features such as Point of Interest (POI) remains underexplored due to their inherent data skewness and highly imbalance distribution. This study introduces an unsupervised discretization method known as Unsupervised Tree-based Discretization with Maximum Cumulative Frequency (UTDMCF), specifically designed to address the limitations of existing algorithms when applied to highly skewed POI features. Unlike equal-width, equal-frequency, or K-Means discretization algorithms, UTDMCF adaptively segments numeric POI values based on local data density and cumulative frequency, preserving the intrinsic distribution of the data without requiring class label supervision. Evaluations were conducted using a comprehensive dataset of 4007 Malaysian students comprising academic records, English test scores, and 420 POI categories. The results demonstrate that UTDMCF significantly reduces skewness from 13.68 to -0.63 in highly skewed features and improves class label distribution. Comparative analysis across twelve academic semesters shows that UTDMCF attains superior predictive performance, with metrics such as accuracy up to 86.7%, recall up to 93.6%, F1-score up to 90.5%, G-Mean up to 82.4%, and AUC up to 92.3%. UTDMCF consistently outperforms equal-width, equal-frequency, K-Means, and non-discretized baselines, particularly in the early and middle academic phases where skewed data poses greater challenges. These findings establish UTDMCF as a highly effective discretization strategy for educational data mining and a valuable tool for supporting proactive student interventions.