Enhanced Category-Feature Association Measure

A Robust Approach for Text Classification through Feature Selection

Authors

DOI:

https://doi.org/10.14500/aro.12034

Keywords:

Dimension reduction, Feature selection, Long short-term memory, Multinomial Naive Bayes, Support vector machines, Text classification

Abstract

Text classification is one of the severe challenges for categorizing large and high-dimensional text data accurately and efficiently. Many features confuse the classification process, and feature selection (FS) strategies should be used to deal with the problem of high dimensionality. This paper proposes a novel FS technique based on enhanced category-feature association measure (ECFAM). ECFAM utilizes the existence and elimination of terms and the complicated relationships among the terms across different sections. This one-of-a-kind approach emphasizes the key role of ancillary terms in classifying and differentiating categories. The comparison is done on two important datasets, Reuters-21578 and 20-Newsgroups, through two widely employed supervised machine learning classifiers and one deep learning algorithm. Throughout our experiments, we investigate the feature sizes in nine different feature sets, ranging from 50 to 4000. Experimental data show that ECFAM always performs better than other methods concerning accuracy and computational cost.

Downloads

Download data is not yet available.

Author Biographies

Soran S. Badawi, Language Center, Charmo University, Chamchamal, Kurdistan Region – F.R. Iraq

Soran S. Badawi is a Lecturer at the Language Center, Charmo Researcher Center for research, training, and consultancy, Charmo University. He got the B.Sc. degree in English Language and Literature at the University of Sulaimani, Iraq, and the M.Sc. degree in Computational Linguistics from Isfahan University, Iran. His research interests are in natural language processing (NLP), machine translation, and sentiment analysis.

Ari M. Saeed, Department of Computer Science, University of Halabja, Halabja, Kurdistan Region – F.R. Iraq

Ari M. Saeed is an Assistant Professor at the Department of Computer, College of Science, University of Halabja. He got the B.Sc. degree in computer science and the M.Sc. degree in computer engineering. His research interests are in machine learning, natural language processing (NLP), and text classification.

Sara A. Ahmed, Department of Computer Engineering, Komar University of Science and Technology, Sulaimaniyah, Kurdistan Region – F.R. Iraq

Sara A. Ahmed is a Lecturer at the Department of Computer Engineering, Faculty of Engineering, Komar University of Science and Technology . She got the B.Sc. degree in Computer Science, the M.Sc. degree in Computer Systems Engineering. Her research interests are in text classification, robotics and artificial intelligence.

Diyari A. Hassan, Department of Biomedical Engineering, Faculty of Engineering and Computer Science, Qaiwan International University, Sulaimaniyah, Kurdistan Region – F.R. Iraq

Diyari A. Hassan is an Assistant Professor at the Department of Biomedical Engineering, Faculty of Engineering and Computer Science, Qaiwan International University. He got the B.Sc. degree in Telecommunication, the M.Sc. degree in Electrical and Electronic Engineering and the Ph.D. degree in Computer Engineering. His research interests are in signal processing, polynomial matrix decomposition and artificial intelligence.

References

Abbas, M., Ali Memon, K., Jamali, A.A., Memon, S., and Ahmed, A., 2019. Multinomial naive Bayes classification model for sentiment analysis. IJCSNS International Journal of Computer Science and Network Security, 19(3), p.62.

Adi, A.O., and Celebi, E., 2014. Classification of 20 news group with Naive Bayes classifier. In: 2014 22nd Signal Processing and Communications Applications Conference (SIU). IEEE, United States, pp.2150-2153. DOI: https://doi.org/10.1109/SIU.2014.6830688

Alyasiri, O.M., Cheah, Y.N., and Abasi, A.K., 2021. Hybrid filter-wrapper text feature selection technique for text classification. In: 2021 International Conference on Communication and Information Technology (ICICT). IEEE, United States, pp.80-86. DOI: https://doi.org/10.1109/ICICT52195.2021.9567898

Badawi, S.S., 2023. Using multilingual bidirectional encoder representations from transformers on medical corpus for Kurdish text classification. ARO-the Scientific Journal of Koya University, 11(1), pp.10-15. DOI: https://doi.org/10.14500/aro.11088

Bhavani, A., and Santhosh Kumar, B., 2021. A review of state art of text classification algorithms. In: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC). IEEE, United States, pp.1484-1490. DOI: https://doi.org/10.1109/ICCMC51019.2021.9418262

Deng, X., Li, Y., Weng, J., Zhang, J., 2019. Feature selection for text classification: A review. Multimedia Tools and Application, 78, pp.3797-3816. DOI: https://doi.org/10.1007/s11042-018-6083-5

Dhal, P., and Azad, C., 2022. A comprehensive survey on feature selection in the various fields of machine learning. Applied Intelligence, 52(4), pp.4543-4581. DOI: https://doi.org/10.1007/s10489-021-02550-9

Dou, G., Zhao, K., Guo, M., and Mou, J., 2023. Memristor-based LSTM network for text classification. Fractals, 31(06), p.2340040. Erenel, Z., Adegboye, O.R., and Kusetogullari, H., 2020. A new feature selection scheme for emotion recognition from text. Applied Sciences, 10(15), p.5351. DOI: https://doi.org/10.3390/app10155351

Ige, O.P., and Gan, K.H., 2024. Ensemble filter-wrapper text feature selection methods for text classification. CMES-Computer Modeling in Engineering and Sciences, 141(2), pp.1847-1865. DOI: https://doi.org/10.32604/cmes.2024.053373

Gudakahriz, S.J., Moghadam, A.M.E., and Mahmoudi, F., 2021. Opinion texts clustering using manifold learning based on sentiment and semantics analysis. Scientific Programming, 2021, p.7842631. DOI: https://doi.org/10.1155/2021/7842631

Jain, D., and Singh, V., 2018. Feature selection and classification systems for chronic disease prediction: A review. Egyptian Informatics Journal, 19(3), pp.179-189. DOI: https://doi.org/10.1016/j.eij.2018.03.002

Jamshidi, S., Mohammadi, M., Bagheri, S., Najafabadi, H.E., Rezvanian, A., Gheisari, M., Ghaderzadeh, M., Shahabi, A.S., and Wu, Z. 2024. Effective text classification using BERT, MTM LSTM, and DT. Data and Knowledge Engineering, 151, p.102306. DOI: https://doi.org/10.1016/j.datak.2024.102306

Kim, K., and Zzang, S.Y., 2019. Trigonometric comparison measure: A feature selection method for text categorization. Data and Knowledge Engineering, 119, pp.1-21. DOI: https://doi.org/10.1016/j.datak.2018.10.003

López-González, J.L., Franco-Villafañe, J.A., Méndez-Sánchez, R.A., Zavala-Vivar, G., Flores-Olmedo, E., Arreola-Lucas, A., and Báez, G., 2021. Deviations from poisson statistics in the spectra of free rectangular thin plates. Physical Review E, 103(4), p.043004. DOI: https://doi.org/10.1103/PhysRevE.103.043004

Lyu, Y., Feng, Y., and Sakurai, K., 2023. A survey on feature selection techniques based on filtering methods for cyber-attack detection. Information, 14(3), p.191. DOI: https://doi.org/10.3390/info14030191

Mamdouh Farghaly, H., and Abd El-Hafeez, T., 2023. A high-quality feature selection method based on frequent and correlated items for text classification. Soft Computing, 27(16), pp.11259-11274. DOI: https://doi.org/10.1007/s00500-023-08587-x

Miao, Y., Wang, J., Zhang, B., and Li, H., 2022. Practical framework of gini index in the application of machinery fault feature extraction. Mechanical Systems and Signal Processing, 165, p.108333. DOI: https://doi.org/10.1016/j.ymssp.2021.108333

Mirończuk, M.M., and Protasiewicz, J., 2018. A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, pp.36-54. DOI: https://doi.org/10.1016/j.eswa.2018.03.058

Murshed, B.A.H., Abawajy, J., Mallappa, S., Saif, M.A.N., and Al-Ariki, H.D.A., 2022. DEA-RNN: A hybrid deep learning approach for cyberbullying detection in twitter social media platform. IEEE Access, 10, pp.25857-258571. DOI: https://doi.org/10.1109/ACCESS.2022.3153675

Noroozi, Z., Orooji, A., and Erfannia, L., 2023. Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction. Scientific Reports, 13(1), p.22588. DOI: https://doi.org/10.1038/s41598-023-49962-w

Omuya, E.O., Okeyo, G.O., and Kimwele, M.W., 2021. Feature selection for classification using principal component analysis and information gain. Expert Systems with Applications, 174, p.114765. DOI: https://doi.org/10.1016/j.eswa.2021.114765

Palanivinayagam, A., El-Bayeh, C.Z., and Damaševičius, R., 2023. Twenty years of machine-learning-based text classification: A systematic review. Algorithms, 16(5), p.236. DOI: https://doi.org/10.3390/a16050236

Parlak, B., and Uysal, A.K., 2023. A novel filter feature selection method for text classification: Extensive feature selector. Journal of Information Science, 49(1), pp.59-78. DOI: https://doi.org/10.1177/0165551521991037

Pudjihartono, N., Fadason, T., Kempa-Liehr, A.W., and O’Sullivan, J.M., 2022. A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics, 2, p.927312. DOI: https://doi.org/10.3389/fbinf.2022.927312

Russell-Rose, T., Stevenson, M., and Whitehead, M. 2002. The Reuters Corpus Volume 1-from Yesterday’s News to Tomorrow’s Language Resources. European Language Resources Association (ELRA), Las Palmas.

Saeed, A.M., Badawi, S., Ahmed, S.A., and Hassan, D.A., 2023. Comparison of feature selection methods in Kurdish text classification. Iran Journal of Computer Science, 7, pp.55-64. DOI: https://doi.org/10.1007/s42044-023-00159-4

Saeed, A.M., Ismael, A.N., Rasul, D.L., Majeed, R.S., and Rashid, T.A., 2022. Hate Speech Detection in Social Media for the Kurdish Language. Springer, Cham, pp.253-260. DOI: https://doi.org/10.1007/978-3-031-14054-9_24

Uysal, A.K., and Gunal, S., 2012. A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36, pp.226-235. DOI: https://doi.org/10.1016/j.knosys.2012.06.005

Zhang, J., Hu, X., Li, P., He, W., Zhang, Y., and Li, H., 2014. A hybrid feature selection approach by correlation-based filters and SVM-RFE. In: 2014 22nd International Conference on Pattern Recognition. IEEE, United States, pp.3684-3689. DOI: https://doi.org/10.1109/ICPR.2014.633

Zhou, H., Wang, X., and Zhu, R., 2022. Feature selection based on mutual information with correlation coefficient. Applied Intelligence, 52(5), pp.5457-5474. DOI: https://doi.org/10.1007/s10489-021-02524-x

Published

2025-08-21

How to Cite

Badawi, S. S. (2025) “Enhanced Category-Feature Association Measure: A Robust Approach for Text Classification through Feature Selection”, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 13(2), pp. 114–123. doi: 10.14500/aro.12034.
Received 2025-02-02
Accepted 2025-07-29
Published 2025-08-21