Diabetes Risk Prediction with Machine Learning Models


Creative Commons License

Özsezer G., Mermer G.

Artificial Intelligence Theory and Applications, cilt.2, sa.2, ss.1-9, 2022 (Hakemli Dergi)

Özet

Diabetes mellitus (DM) is one of the most common chronic diseases worldwide, which is a major public health problem. The aim of this study is to predict DM risk with machine learning (ML) models using available data. In the analytical study, the “Diabetes Health Indicators Dataset” consisting of 253680 data and 21 variables collected annually by the CDC was used. The open access dataset was retrieved from Kaggle on March 5, 2022. Data analysis was done with Phyton 3.0 programming language using numpy, pandas, matplotlib, seaborn, sciktlearn, imblearn libraries. With data pre-processing, outliers and missing data were removed. KNN, Logistic regression, Decision tree, Random forest and Naive Bayes from ML algorithms were used in predictive modeling. The prediction rate of the algorithms was evaluated with accuracy, precision, recall and F1 Score. It did not require permission as the data was open access. KNN’s accuracy was 0.74, precision 0.31, recall 0.55, F1 score 0.39; Logistic regression’s accuracy was 0.72; precision 0.33, recall 0.74, F1 score 0.46; Decision tree’s was accuracy 0.84, precision 0.54 recall 0.15, F1 score 0.24; Random forest’s accuracy was 0.84, precision 0.56, recall 0.16, F1 score 0.25; Naive bayes's accuracy was 0.84, precision 0.52, recall 0.19, F1 score 0.28. In this study, ML algorithms were used for DM risk estimation. According to the experimental results, when the data set is divided into random training (80%) and testing (20%), the accuracy values of random forest and decision tree algorithms are very close to each other (RF: 0.848, DT: 0.847). Therefore, it can be said that the two best algorithms for diabetes risk estimation are random forest and decision tree.