On classification of abstracts obtained from medical journals

Parlak, Bekir; Uysal, ALPER

doi:10.1177/0165551519860982

On classification of abstracts obtained from medical journals

Parlak B., Uysal A. K.

JOURNAL OF INFORMATION SCIENCE, 2019 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası:
Basım Tarihi: 2019
Doi Numarası: 10.1177/0165551519860982
Dergi Adı: JOURNAL OF INFORMATION SCIENCE
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus
Çanakkale Onsekiz Mart Üniversitesi Adresli: Hayır

Özet

Classification of medical documents was mostly carried out on English data sets and these studies were performed on hospital records rather than academic texts. The main reasons behind this situation are the lack of publicly available data sets and the tasks being costly and time-consuming. As the first contribution of this study, two data sets including Turkish and English counterparts of the same abstracts published in Turkish medical journals were constructed. Turkish is one of the widely used agglutinative languages worldwide and English is a good example of non-agglutinative languages. While English abstracts were obtained automatically from MEDLINE database with a computer program, Turkish counterparts of these documents were collected manually from the Internet. As the second contribution of this study, an extensive comparison on classification of abstracts obtained from Turkish medical journals was made by using these two equivalent data sets. Features were extracted from text documents with three different approaches: unigram, bigram and hybrid. Hybrid approach includes a combination of unigram and bigram features. In the experiments, three different feature selection methods and seven different classifiers were utilised. According to the results on both data sets, classification performance of the English abstracts outperformed the Turkish counterparts. Maximum accuracies were obtained from the combination of unigram features, distinguishing feature selector (DFS) and multinomial naive Bayes (MNB) classifier for both data sets. Unigram features were generally more efficient than bigram and hybrid features. However, analysis of top-10 features indicated that nearly half of the features were translations of each other for Turkish and English data sets.