On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification


Dogan T., Uysal A. K.

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, cilt.44, ss.9545-9560, 2019 (SCI İndekslerine Giren Dergi) identifier identifier

  • Cilt numarası: 44 Konu: 11
  • Basım Tarihi: 2019
  • Doi Numarası: 10.1007/s13369-019-03920-9
  • Dergi Adı: ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING
  • Sayfa Sayıları: ss.9545-9560

Özet

The performance of text classification can be affected by the choice of appropriate term weighting scheme as well as other parameters. The terminology supervised term weighting scheme has become popular in recent years, as it may provide discriminative representation in vector space for text documents belonging to different classes. A term weighting scheme generally consists of three factors, namely term frequency factor, collection frequency factor, and length normalization factor. The researchers mostly have been focused on developing new collection frequency factors in term weighting studies. However, the term frequency factor has an important role, especially in supervised term weighting. In this study, we extensively analyzed the effects of using different term frequency factors on seven supervised term weighting schemes. While six of these supervised term weighting schemes were applied in the previous studies in the literature, we derived one of them from an existing feature selection method and it was not used as a weighting method before. This analysis is performed using SVM and Roccio classifiers on two widely known benchmark datasets with different characteristics. Experimental results showed that modification of term frequency factor in supervised term weighting schemes increased the performance of almost all weighting schemes. Also, term weighting schemes using square root function-based term frequency factor (SQRT_TF) are more successful than the ones using term frequency (TF) and logarithmic function-based term frequency (LOG_TF) factors. TF term frequency factor seems as the least effective one among three different term frequency factors according to the experimental results and statistical analysis.