Adam Idczak , Jerzy Korzeniewski

© Główny Urząd Statystyczny Artykuł udostępniony na licencji CC BY-SA 4.0


(Angielski) PDF


Sentiment analysis of text documents is a very important part of contemporary text mining. The purpose of this article is to present a new technique of text sentiment analysis which can be used with any type of a document-sentiment-classification method. The proposed technique involves feature selection independently of a classifier, which reduces the size of the feature space. Its advantages include intuitiveness and computational noncomplexity. The most important element of the proposed technique is a novel algorithm for the determination of the number of features to be selected sufficient for the effective classification. The algorithm is based on the analysis of the correlation between single features and document labels. A statistical approach, featuring a naive Bayes classifier and logistic regression, was employed to verify the usefulness of the proposed technique. They were applied to three document sets composed of 1,169 opinions of bank clients, obtained in 2020 from a Poland-based bank. The documents were written in Polish. The research demonstrated that reducing the number of terms over 10-fold by means of the proposed algorithm in most cases improves the effectiveness of classification.


sentiment analysis, document sentiment classification, text mining, logistic regression, naive Bayes classifier, feature selection, correlation


C52, C81, M31


© 2019-2022 Copyright by Główny Urząd Statystyczny, pewne prawa zastrzeżone. Licencja Creative Commons Uznanie autorstwa - Na tych samych warunkach 4.0 (CC BY-SA 4.0)