The effects of globalisation techniques on feature selection for text classification


Parlak B., Uysal A. K.

JOURNAL OF INFORMATION SCIENCE, 2020 (SCI İndekslerine Giren Dergi) identifier identifier

  • Cilt numarası:
  • Basım Tarihi: 2020
  • Doi Numarası: 10.1177/0165551520930897
  • Dergi Adı: JOURNAL OF INFORMATION SCIENCE

Özet

Text classification (TC) is very important and critical task in the 21th century as there exist high volume of electronic data on the Internet. In TC, textual data are characterised by a huge number of highly sparse features/terms. A typical TC consists of many steps and one of the most important steps is undoubtedly feature selection (FS). In this study, we have comprehensively investigated the effects of various globalisation techniques on local feature selection (LFS) methods using datasets with different characteristics such as multi-class unbalanced (MCU), multi-class balanced (MCB), binary-class unbalanced (BCU) and binary-class balanced (BCB). The globalisation techniques used in this study are summation (SUM), weighted-sum (AVG), and maximum (MAX). To investigate the effect of globalisation techniques, we used three LFS methods named as Discriminative Feature Selection (DFSS), odds ratio (OR) and chi-square (CHI2). In the experiments, we have utilised four different benchmark datasets named as Reuters-21578, 20Newsgroup., Enron1, and Polarity in addition to Support Vector Machines (SVM) and Decision Tree (DT) classifiers. According to the experimental results, the most successful globalisation technique is AVG while all situations are taken into account. The experimental results indicate that DFSS method is more successful than OR and CHI2 methods on datasets with MCU and MCB characteristics. However, CHI2 method seems more accurate than OR and DFSS methods on datasets with BCU and BCB characteristics. Also, SVM classifier performed better than DT classifier in most cases.