Using Multilingual Bidirectional Encoder Representations from Transformers on Medical Corpus for Kurdish Text Classification

  • Soran S. Badawi Charmo Center for Scientific Research and Consulting – Language and Linguistic Center, Charmo University Chamchamal, Sulaimani, Kurdistan region - F.R. Iraq https://orcid.org/0000-0001-9117-3078
Keywords: Bidirectional Encoder Representations from Transformers, Deep learning, Machine learning, Natural language processing, Sentiment analysis, Transformers

Abstract

Technology has dominated a huge part of human life. Furthermore, technology users use language continuously to express feelings and sentiments about things. The science behind identifying human attitudes toward a particular product, service,or topic is one of the most active fields of research, and it is called sentiment analysis. While the English language is making real progress in sentiment analysis daily, other less-resourced languages, such as Kurdish, still suffer from fundamental issues and challenges in Natural Language Processing (NLP). This paper experimentswith the recently published medical corpus using the classical machine learning method and the latest deep learning tool in NLP and Bidirectional Encoder Representations from Transformers (BERT). We evaluated the findings of both machine learning and deep learning. The outcome indicates that BERT outperforms all the machine learning classifiers by scoring (92%) in accuracy, which is by two points higher than machine learning classifiers.

Downloads

Download data is not yet available.

Author Biography

Soran S. Badawi, Charmo Center for Scientific Research and Consulting – Language and Linguistic Center, Charmo University Chamchamal, Sulaimani, Kurdistan region - F.R. Iraq

Soran S. Badawi is an Assistant Lecturer at the Language Center, Charmo Researcher Center for research, training and consultancy, Charmo University. He got the B.Sc. degree in English Language and Literature at University of Sulaimani, Iraq, and  the M.Sc. degree in Computational Linguistics from Isfahan University, Iran. His research interests are in Natural Language Processing (NLP), Machine Translation and Sentiment Analysis.

References

Abdulla, S. and Hama, M. H., 2015. Sentiment analyses for kurdish social network texts using naive bayes classifier. Journal of University of Human Development, 1(4), pp. 393-397.

Ahmadi, S., 2020. KLPT-Kurdish Language Processing Toolkit., In: Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pp. 72-84.

Cieliebak, M., Deriu, J.M., Egger, D. and Uzdilli, F., 2017. A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Medias. pp. 45-51.

Collobert, R., Weston, J., Bottu, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P., 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, pp. 2493-2537.

Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Available from: http://arxiv.org/abs/1810.04805

Esmaili, K., 2012. Challenges in Kurdish text processing. arXiv preprint arXiv:1212.0074.

Farahani, M., Gharachorloo, M., Farahani, M. and Manthouri, M., 2021. Parsbert: Transformer-based model for persian language understanding. Neural Processing Letters, 53(6), pp. 3831-3847.

Hoang, M., Bihorac, O.A. and Rouces, J., 2019. Aspect-Based Sentiment Analysis Using Bert. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics. pp. 187-196.

LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, 521(7553), pp. 436-444.

Ling, J., 2020. Coronavirus Public Sentiment Analysis with BERT Deep Learning. Dalarna University, Sweden.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In: Proc. Advances in Neural Information Processing Systems. 26, pp.3111–3119.

Pennington, J., Socher, R. and Manning, C., 2014. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543.

Saeed, A.M., Hussein, S. R., Ali, C.M. and Rashid, T. A., 2022. Medical dataset classification for Kurdish short text over social media. Data Brief, 42, p.108089.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L, Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems, Vol. 30. NeurIPS Proceedings.

Zahera, H. M. Elgendy, I., Jalota, R. and Sherif, M.A., 2019. Fine-tuned BERT Model for Multi-Label Tweets Classification. The Real Estate Company, Mumbai. pp. 1-7.

Published
2023-01-15
How to Cite
Badawi, S. S. (2023) “Using Multilingual Bidirectional Encoder Representations from Transformers on Medical Corpus for Kurdish Text Classification”, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 11(1), pp. 10-15. doi: 10.14500/aro.11088.