New algorithm for determining the number  of features for the effective sentiment-classification  of text documents

Adam Idczak; Jerzy Korzeniewski

doi:10.59139/ws.2023.05.3

New algorithm for determining the number of features for the effective sentiment-classification of text documents

Adam Idczak Uniwersytet Łódzki, Wydział Ekonomiczno-Socjologiczny, Polska / University of Lodz, Faculty of Economics and Sociology, Poland. ORCID: https://orcid.org/0000-0001-9676-2410 , Jerzy Korzeniewski Uniwersytet Łódzki, Wydział Ekonomiczno-Socjologiczny, Polska / University of Lodz, Faculty of Economics and Sociology, Poland. ORCID: https://orcid.org/0000-0001-6526-5921 Wiadomości Statystyczne. The Polish Statistician, vol. 68, 2023, 5, s. 40-57 Opublikowano online: 31 maja 2023 DOI 10.59139/ws.2023.05.3 Sposób cytowania: Idczak, A., Korzeniewski, J. (2023). New algorithm for determining the number of features for the effective sentiment-classification of text documents. Wiadomości Statystyczne. The Polish Statistician, 68(5), 40–57. https://doi.org/10.59139/ws.2023.05.3.

1853 Wyświetlenia 123 Pobrania

ARTYKUŁ

(Angielski) PDF

STRESZCZENIE

Sentiment analysis of text documents is a very important part of contemporary text mining. The purpose of this article is to present a new technique of text sentiment analysis which can be used with any type of a document-sentiment-classification method. The proposed technique involves feature selection independently of a classifier, which reduces the size of the feature space. Its advantages include intuitiveness and computational noncomplexity. The most important element of the proposed technique is a novel algorithm for the determination of the number of features to be selected sufficient for the effective classification. The algorithm is based on the analysis of the correlation between single features and document labels. A statistical approach, featuring a naive Bayes classifier and logistic regression, was employed to verify the usefulness of the proposed technique. They were applied to three document sets composed of 1,169 opinions of bank clients, obtained in 2020 from a Poland-based bank. The documents were written in Polish. The research demonstrated that reducing the number of terms over 10-fold by means of the proposed algorithm in most cases improves the effectiveness of classification.

SŁOWA KLUCZOWE

sentiment analysis, document sentiment classification, text mining, logistic regression, naive Bayes classifier, feature selection, correlation

JEL

C52, C81, M31

BIBLIOGRAFIA

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., & Passonneau, R. (2011). Sentiment Analysis of Twitter Data. W: LSM ’11: Proceedings of the Workshop on Languages in Social Media (s. 30–38). Association for Computational Linguistics.

Davies, A., & Ghahramani, Z. (2011). Language-independent Bayesian sentiment mining of Twitter. W: The fifth SNAKDD Workshop 2011 on Social Network Mining and Analysis (s. 99–106).

Domański, C., & Pruska, K. (2000). Nieklasyczne metody statystyczne. Polskie Wydawnictwo Ekonomiczne.

Elakkiya, E., Selvakumar, S. (2020). GAMEFEST: Genetic Algorithmic Multi Evaluation measure based FEature Selection Technique for social network spam detection. Multimed Tools and Application, 79(11–12), 7193–7225. https://doi.org/10.1007/s11042-019-08334-1.

Govindarajan, M. (2013). Sentiment Analysis of Movie Reviews using Hybrid Method of Naive Bayes and Genetic Algorithm. International Journal of Advanced Computer Research, 3(4), 139– 145. https://accentsjournals.org/PaperDirectory/Journal/IJACR/2013/12/21.pdf.

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). John Wiley & Sons. https://doi.org/10.1002/9781118548387.

Idczak, A. P. (2021). Sentiment Classification of Bank Clients’ Reviews Written in the Polish Language. Acta Universitatis Lodziensis. Folia Oeconomica, (2), 43–56. https://doi.org/10.18778/0208-6018.353.03.

Iqbal, F., Hashmi, J. M., Fung, B. C. M., Batool, R., Khattak, A. M., Aleem, S., & Hung, P. C. K. (2019). A Hybrid Framework for Sentiment Analysis Using Genetic Algorithm Based Feature Reduction. IEEE Access, 7, 14637–14652. http://doi.org/10.1109/ACCESS.2019.2892852.

Khan, A., Baharudin, B., & Khan, K. (2011). Sentiment Classification Using Sentence-level Lexical Based Semantic Orientation of Online Reviews. Trends in Applied Sciences Research, 6(10), 1141–1157. https://doi.org/10.3923/tasr.2011.1141.1157.

Korzeniewski, J. (2012). Metody selekcji zmiennych w analizie skupień. Nowe procedury. Wydawnictwo Uniwersytetu Łódzkiego. http://dx.doi.org/10.18778/7525-695-6.

Kouloumpis, E., Wilson, T., & Moore, J. (2011). Twitter Sentiment Analysis: The Good the Bad and the OMG!. Proceedings of the Sixteenth International AAAI Conference on Web and Social Media, 5(1), 538–541. https://doi.org/10.1609/icwsm.v5i1.14185.

Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113. https://doi.org/10.1016/j.asej.2014.04.011.

Njolstad, P. C. S., Hoysater, L. S., Wei, W., & Gulla, J. A. (2014). Evaluating Feature Sets and Classifiers for Sentiment Analysis of Financial News. W: WI-IAT ’14: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) (p. 71–78). IEEE. https://doi.org/10.1109/WI-IAT.2014.82.

Pintas, J. T., Fernandes, L. A. F., & Garcia, A. C. B. (2021). Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review, 54(8), 6149–6200. https://doi.org/10.1007/s10462-021-09970-6.

Yassir, A. H., Mohammed, A. A., Alkhazraji, A. A. J., Hameed, M. E., Talib, M. S., & Ali, M. F. (2020). Sentimental classification analysis of polarity multi-view textual data using data mining techniques. International Journal of Electrical & Computer Engineering (2088–8708), 10(5), 5526–5533. http://doi.org/10.11591/ijece.v10i5.pp5526-5534.

Yazdani, S. F., Murad, M. A. A., Sharef, N. M., Singh, Y. P., & Latiff, A. R. A. (2017). Sentiment Classification of Financial News Using Statistical Features. International Journal of Pattern Recognition and Artificial Intelligence, 31(3), 1–34. https://doi.org/10.1142/S0218001417500069.

Wróć do: