Comparison of Different Machine Learning Models with Data Balancing for Prediction of Cardiovascular Disease Risks Based on Big Data

Özsezer, GÖZDE

doi:10.21597/jist.1800624

Comparison of Different Machine Learning Models with Data Balancing for Prediction of Cardiovascular Disease Risks Based on Big Data

Özsezer G.

Journal of the Institute of Science and Technology, cilt.16, sa.2, ss.461-487, 2026 (TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 16 Sayı: 2
Basım Tarihi: 2026
Doi Numarası: 10.21597/jist.1800624
Dergi Adı: Journal of the Institute of Science and Technology
Derginin Tarandığı İndeksler: TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.461-487
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet

Özet

Cardiovascular diseases (CVD) are a leading global cause of death and morbidity. This study evaluates data balancing techniques (SMOTE, ENN, SMOTE-ENN, SMOTE-Tomek) and machine learning (ML) algorithms for predicting CVD risk using big data. The 2021 CDC BRFSS dataset, with 308,854 records, was preprocessed by removing missing and irrelevant data. The dataset was split into 80% training and 20% testing subsets. ML models, including logistic regression, random forest, LightGBM, XGBoost, and CatBoost, were trained on balanced data. Performance metrics such as accuracy, precision, recall, F1 score, ROC curve, and AUC were used for evaluation. SMOTE-ENN and SMOTE-Tomek improved model performance, with LightGBM and CatBoost achieving the highest AUC and F1 scores. Results demonstrate that data balancing, especially SMOTE-ENN, enhances model sensitivity, aiding CVD risk identification. These findings underscore the potential for ML in nursing to develop targeted interventions and improve outcomes.