Machine Learning Algorithms for Detecting and Analyzing Social Bots Using a Novel Dataset

  • Niyaz Jalal (1) Department of Software and Informatics Engineering, College of Engineering, Salahaddin University-Erbil, Erbil 44001, Iraq. (2) Department of Computer Science, Knowledge University, Erbil 44001, Iraq https://orcid.org/0000-0002-2575-5878
  • Kayhan Z. Ghafoor (1) Department of Software and Informatics Engineering, College of Engineering, Salahaddin University-Erbil, Erbil 44001, Iraq. (2) Department of Computer Science, Knowledge University, Erbil 44001, Iraq. https://orcid.org/0000-0001-9046-9475
Keywords: Machine Learning, Misinformation detection, Twitter bot detection, Twitter profile Metadata

Abstract

Social media is internet-based technology and an electronic form of communication that facilitates sharing of ideas, documents, and personal information. Twitter is a microblogging platform and is the most effective social service for posting microblogs and likings, commenting, sharing, and communicating with others. The problem we are shedding light on in this paper is the misuse of bots on Twitter. The purpose of bots is to automate specific repetitive tasks instead of human interaction. However, bots are misused to influence people’s minds by spreading rumors and conspiracy related to controversial topics. In this paper, we initiate a new benchmark created on a 1.5M Twitter profile. We train different supervised machine learning on our benchmark to detect bots on Twitter. In addition to increasing benchmark scalability, various autofeature selections are utilized to identify the most influential features and remove the less influential ones. Furthermore, over-under-sampling is applied to reduce the imbalance effect on the benchmark. Finally, our benchmark compared with other stateof-the-art benchmarks and achieved a 6% higher area under the curve than other datasets in the case of generalization, improving the model performance by at least 2% by applying over-/undersampling.

Downloads

Download data is not yet available.

Author Biographies

Niyaz Jalal, (1) Department of Software and Informatics Engineering, College of Engineering, Salahaddin University-Erbil, Erbil 44001, Iraq. (2) Department of Computer Science, Knowledge University, Erbil 44001, Iraq

Niyaz Jalal got the B.Sc. degree in Software Engineering from the Software and Informatics Engineering Department, College of Engineering, Salahaddin University-Erbil, in 2017, where he is currently pursuing the master's degree in Software Engineering. His thesis is on regards bot detection on social platforms. His research interests are in Cybersecurity, Artificial Intelligence and Nature Inspired Algorithms.

Kayhan Z. Ghafoor, (1) Department of Software and Informatics Engineering, College of Engineering, Salahaddin University-Erbil, Erbil 44001, Iraq. (2) Department of Computer Science, Knowledge University, Erbil 44001, Iraq.

Kayhan Z. Ghafoor is an associate professor at Salahaddin University-Erbil and avisiting scholar at the University of Wolverhampton. Before that, he was a postdoctoral research fellow at Shanghai Jiao Tong University, where he contributed to two research projects funded by National Natural Science Foundation of China and National Key Research and Development Program. He is also served as a visiting researcher at University Technology Malaysia. He received the B.Sc. degree in electrical engineering, the M.Sc. degree in remote weather monitoring and the Ph.D. degree in wireless networks in 2003, 2006, and 2011, respectively. He is the recipient of the UTM Chancellor Award at the 48th UTM convocation in 2012."

References

Adewole, K.S., Anuar, N.B., Kamsin, A., Varathan, K.D. and Razak, S.A., 2017. Malicious accounts: Dark of the social networks. Journal of Network and Computer Applications, 79, pp.41-67.

Alom, Z., Carminati, B. and Ferrari, E., 2018. Detecting spam accounts on Twitter. In: Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2018. Institute of Electrical and Electronics Engineers Inc., Piscataway, New Jersey. pp.1191-1198.

Altmann, A., Toloşi, L., Sander, O. and Lengauer, T., 2010. Permutation importance: A corrected feature importance measure. Bioinformatics, 26(10), pp.1340-1347.

Brown, P.F., de Souza, P.V., Mercer, R.L., Della Pietra, V.J. and Lai, J.C., n.d. Class-Based n-gram Models of Natural Language. Computational linguistics, 18, pp.467–480.

Davis, C.A., Varol, O., Ferrara, E., Flammini, A. and Menczer, F., 2016. BotOrNot: A System to Evaluate Social Bots. In: Proceedings of the 25th International Conference Companion on World Wide Web. pp.14-16.

Derhab, A., Alawwad, R., Dehwah, K., Tariq, N., Khan, F.A. and Al-Muhtadi, J., 2021. Tweet-based bot detection using big data analytics. IEEE Access, 9, pp.65988-66005.

Elhassan, T. and Aljurf, M., 2016. Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method. Global J Technol Optim, 1, pp.1-11.

Feng, S., Wan, H., Wang, N., Li, J. and Luo, M., 2021. TwiBot-20: A comprehensive twitter bot detection benchmark. arXiv, 2021, p.13088.

Ferrara, E., Varol, O., Davis, C., Menczer, F. and Flammini, A., 2016. The rise of social bots. Communications of the ACM, 59, pp.96-104.

Granitto, P.M., Furlanello, C., Biasioli, F. and Gasperi, F., 2006. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83, pp.83-90.

Hanouna, S., Neu, O., Pardo, S., Tsur, O. and Zahavi, H., 2019. Sharp power in social media: Patterns from datasets across electoral campaigns. Australian and New Zealand Journal of European Studies, 11, pp.95-111.

Hayawi, K., Mathew, S., Venugopal, N., Masud, M.M. and Ho, P.H., 2022. DeeProBot: A hybrid deep neural network model for social bot detection based on user profile data. Social Network Analysis and Mining, 12, p.43.

Huang, J. and Ling, C.X., 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17, pp.299-310.

Jović, A., Brkić, K. and Bogunović, N., 2015. A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE. pp.1200-1205.

Khanday, A.M.U., Khan, Q.R. and Rabani, S.T., 2021. Identifying propaganda from online social networks during COVID-19 using machine learning techniques. International Journal of Information Technology (Singapore), 13, pp.115-122.

Kudugunta, S. and Ferrara, E., 2018. Deep neural networks for bot detection. Information Sciences, 467, pp.312-322.

Martin-Gutierrez, D., Hernandez-Penaloza, G., Hernandez, A.B., Lozano-Diez, A. and Alvarez, F., 2021. A deep learning approach for robust detection of bots in twitter using transformers. IEEE Access, 9, pp.54591-54601.

Orabi, M., Mouheb, D., Al Aghbari, Z. and Kamel, I., 2020a. Detection of bots in social media: Asystematic review. Information Processing and Management, 57, p.102250.

Orabi, M., Mouheb, D., Al Aghbari, Z. and Kamel, I., 2020b. Detection of bots in social media: Asystematic review. Information Processing and Management, 57, p.102250.

Peng, H., Long, F. and Ding, C., 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, pp.1226-1238.

Rodríguez-Ruiz, J., Mata-Sánchez, J.I., Monroy, R., Loyola-González, O. and López-Cuevas, A., 2020. A one-class classification approach for bot detection on Twitter. Computers and Security, 91, 101715.

Shannon, C.E. and Weaver, W., 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana, IL.

Shaw, R.G. and Mitchell-Olds, T., 1993. ANOVA for unbalanced data: An overview. Ecology, 74, pp.1638-1645.

Shevtsov, A., Tzagkarakis, C., Antonakaki, D. and Ioannidis, S., 2021. Identification of Twitter Bots Based on an Explainable Machine Learning Framework: The US 2020 Elections Case Study. Proceedings of the International AAAI Conference on Web and Social Media.

Shukla, H., Jagtap, N. and Patil, B., 2021. Enhanced twitter bot detection using ensemble machine learning. In: Proceedings of the 6th International Conference on Inventive Computation Technologies, ICICT 2021. Institute of Electrical and Electronics Engineers Inc., Piscataway, New Jersey. pp.930-936.

Varol, O., Ferrara, E., Davis, C.A., Menczer, F. and Flammini, A., 2017. Online Human-bot Interactions: Detection, Estimation, and Characterization. In: Proceedings of the 11th International Conference on Web and Social Media, ICWSM 2017, pp.280-289.

Wang, J., Xu, M., Wang, H. and Zhang, J., 2006. Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding. In: 2006 8th International Conference on Signal Processing. IEEE.

Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 3, pp.408-421.

Yang, K.C., Varol, O., Hui, P.M. and Menczer, F., 2019. Scalable and generalizable social bot detection through data selection. arXiv, 2019, p. 09179.

Published
2022-09-10
How to Cite
Jalal, N. and Ghafoor, K. Z. (2022) “Machine Learning Algorithms for Detecting and Analyzing Social Bots Using a Novel Dataset”, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 10(2), pp. 11-21. doi: 10.14500/aro.11032.