A novel filter feature selection method for text classification: Extensive Feature Selector

Parlak, Bekir; UYSAL, ALPER

doi:10.1177/0165551521991037

A novel filter feature selection method for text classification: Extensive Feature Selector

Atıf İçin Kopyala

Parlak B., UYSAL A. K.

JOURNAL OF INFORMATION SCIENCE, cilt.49, sa.1, ss.59-78, 2023 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 49 Sayı: 1
Basım Tarihi: 2023
Doi Numarası: 10.1177/0165551521991037
Dergi Adı: JOURNAL OF INFORMATION SCIENCE
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, FRANCIS, IBZ Online, Periodicals Index Online, ABI/INFORM, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, EBSCO Education Source, Education Abstracts, Index Islamicus, Information Science and Technology Abstracts, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science, Library, Information Science & Technology Abstracts (LISTA), Metadex, Civil Engineering Abstracts
Sayfa Sayıları: ss.59-78
Anahtar Kelimeler: Dimension reduction, feature selection, text classification
Çanakkale Onsekiz Mart Üniversitesi Adresli: Evet

Özet

As the huge dimensionality of textual data restrains the classification accuracy, it is essential to apply feature selection (FS) methods as dimension reduction step in text classification (TC) domain. Most of the FS methods for TC contain several number of probabilities. In this study, we proposed a new FS method named as Extensive Feature Selector (EFS), which benefits from corpus-based and class-based probabilities in its calculations. The performance of EFS is compared with nine well-known FS methods, namely, Chi-Squared (CHI2), Class Discriminating Measure (CDM), Discriminative Power Measure (DPM), Odds Ratio (OR), Distinguishing Feature Selector (DFS), Comprehensively Measure Feature Selection (CMFS), Discriminative Feature Selection (DFSS), Normalised Difference Measure (NDM) and Max-Min Ratio (MMR) using Multinomial Naive Bayes (MNB), Support-Vector Machines (SVMs) and k-Nearest Neighbour (KNN) classifiers on four benchmark data sets. These data sets are Reuters-21578, 20-Newsgroup, Mini 20-Newsgroup and Polarity. The experiments were carried out for six different feature sizes which are 10, 30, 50, 100, 300 and 500. Experimental results show that the performance of EFS method is more successful than the other nine methods in most cases according to micro-F1 and macro-F1 scores.