The impact of preprocessing on text classification

Uysal, ALPER; Gunal, Serkan

doi:10.1016/j.ipm.2013.08.006

The impact of preprocessing on text classification

Uysal A. K., Gunal S.

INFORMATION PROCESSING & MANAGEMENT, cilt.50, sa.1, ss.104-112, 2014 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 50 Sayı: 1
Basım Tarihi: 2014
Doi Numarası: 10.1016/j.ipm.2013.08.006
Dergi Adı: INFORMATION PROCESSING & MANAGEMENT
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus
Sayfa Sayıları: ss.104-112
Çanakkale Onsekiz Mart Üniversitesi Adresli: Hayır

Özet

Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on. (C) 2013 Elsevier Ltd. All rights reserved.