Comparison of Different Machine Learning Models with Data Balancing for Prediction of Cardiovascular Disease Risks Based on Big Data


Creative Commons License

Özsezer G.

Journal of the Institute of Science and Technology, cilt.16, sa.2, ss.461-487, 2026 (TRDizin)

Özet

Cardiovascular diseases (CVD) are a leading global cause of death and morbidity. This study evaluates data balancing techniques (SMOTE, ENN, SMOTE-ENN, SMOTE-Tomek) and machine learning (ML) algorithms for predicting CVD risk using big data. The 2021 CDC BRFSS dataset, with 308,854 records, was preprocessed by removing missing and irrelevant data. The dataset was split into 80% training and 20% testing subsets. ML models, including logistic regression, random forest, LightGBM, XGBoost, and CatBoost, were trained on balanced data. Performance metrics such as accuracy, precision, recall, F1 score, ROC curve, and AUC were used for evaluation. SMOTE-ENN and SMOTE-Tomek improved model performance, with LightGBM and CatBoost achieving the highest AUC and F1 scores. Results demonstrate that data balancing, especially SMOTE-ENN, enhances model sensitivity, aiding CVD risk identification. These findings underscore the potential for ML in nursing to develop targeted interventions and improve outcomes.