On Two-Stage Feature Selection Methods for Text Classification


Uysal A. K.

IEEE ACCESS, cilt.6, ss.43233-43251, 2018 (SCI İndekslerine Giren Dergi) identifier identifier

  • Cilt numarası: 6
  • Basım Tarihi: 2018
  • Doi Numarası: 10.1109/access.2018.2863547
  • Dergi Adı: IEEE ACCESS
  • Sayfa Sayıları: ss.43233-43251

Özet

Text classification is a high dimensional pattern recognition problem where feature selection is an important step. Although researchers still propose new feature selection methods, there exist many two-stage feature selection methods combining existing filter-based feature selection methods with feature transformation and wrapper-based feature selection methods in different ways. The main focus of the study is to extensively analyze two-stage feature selection methods for text classification from a different point of view. Two-stage feature selection methods that are constituted by combining filter-based local feature selection methods with feature transformation and wrapper-based feature selection methods were investigated in this paper. In the first stage, four different filter-based local feature selection methods and three different feature set construction methods were employed. Feature sets were constructed either by using maximum globalization policy (MAX), by using weighted averaging globalization policy (AVG), or by selecting an equal number of features for each class (EQ). In the second stage, principal component analysis (PCA), latent semantic indexing (LSI), or genetic algorithms were utilized. Various settings were evaluated with a linear support vector machines classifier on two benchmark data sets, namely, Reuters and Ohsumed using Micro-Fl and Macro-Fl scores. According to the findings, AVG and EQ feature set construction methods are usually more successful than MAX method for two-stage feature selection methods. Most of the highest accuracies were obtained by employing PCA feature transformation in the second stage. However, there is a strong linear correlation between PCA and LSI for all settings but the degree of correlation is slightly more for Ohsumed data set in comparison with the Reuters data set.