Machine Learning Algorithms for Detecting and Analyzing Social Bots Using a Novel Dataset
Social media is internet-based technology and an electronic form of communication that facilitates sharing of ideas, documents, and personal information. Twitter is a microblogging platform and is the most effective social service for posting microblogs and likings, commenting, sharing, and communicating with others. The problem we are shedding light on in this paper is the misuse of bots on Twitter. The purpose of bots is to automate specific repetitive tasks instead of human interaction. However, bots are misused to influence people’s minds by spreading rumors and conspiracy related to controversial topics. In this paper, we initiate a new benchmark created on a 1.5M Twitter profile. We train different supervised machine learning on our benchmark to detect bots on Twitter. In addition to increasing benchmark scalability, various autofeature selections are utilized to identify the most influential features and remove the less influential ones. Furthermore, over-under-sampling is applied to reduce the imbalance effect on the benchmark. Finally, our benchmark compared with other stateof-the-art benchmarks and achieved a 6% higher area under the curve than other datasets in the case of generalization, improving the model performance by at least 2% by applying over-/undersampling.
Adewole, K.S., Anuar, N.B., Kamsin, A., Varathan, K.D. and Razak, S.A., 2017. Malicious accounts: Dark of the social networks. Journal of Network and Computer Applications, 79, pp.41-67.
Alom, Z., Carminati, B. and Ferrari, E., 2018. Detecting spam accounts on Twitter. In: Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2018. Institute of Electrical and Electronics Engineers Inc., Piscataway, New Jersey. pp.1191-1198.
Altmann, A., Toloşi, L., Sander, O. and Lengauer, T., 2010. Permutation importance: A corrected feature importance measure. Bioinformatics, 26(10), pp.1340-1347.
Brown, P.F., de Souza, P.V., Mercer, R.L., Della Pietra, V.J. and Lai, J.C., n.d. Class-Based n-gram Models of Natural Language. Computational linguistics, 18, pp.467–480.
Davis, C.A., Varol, O., Ferrara, E., Flammini, A. and Menczer, F., 2016. BotOrNot: A System to Evaluate Social Bots. In: Proceedings of the 25th International Conference Companion on World Wide Web. pp.14-16.
Derhab, A., Alawwad, R., Dehwah, K., Tariq, N., Khan, F.A. and Al-Muhtadi, J., 2021. Tweet-based bot detection using big data analytics. IEEE Access, 9, pp.65988-66005.
Elhassan, T. and Aljurf, M., 2016. Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method. Global J Technol Optim, 1, pp.1-11.
Feng, S., Wan, H., Wang, N., Li, J. and Luo, M., 2021. TwiBot-20: A comprehensive twitter bot detection benchmark. arXiv, 2021, p.13088.
Ferrara, E., Varol, O., Davis, C., Menczer, F. and Flammini, A., 2016. The rise of social bots. Communications of the ACM, 59, pp.96-104.
Granitto, P.M., Furlanello, C., Biasioli, F. and Gasperi, F., 2006. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83, pp.83-90.
Hanouna, S., Neu, O., Pardo, S., Tsur, O. and Zahavi, H., 2019. Sharp power in social media: Patterns from datasets across electoral campaigns. Australian and New Zealand Journal of European Studies, 11, pp.95-111.
Hayawi, K., Mathew, S., Venugopal, N., Masud, M.M. and Ho, P.H., 2022. DeeProBot: A hybrid deep neural network model for social bot detection based on user profile data. Social Network Analysis and Mining, 12, p.43.
Huang, J. and Ling, C.X., 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17, pp.299-310.
Jović, A., Brkić, K. and Bogunović, N., 2015. A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE. pp.1200-1205.
Khanday, A.M.U., Khan, Q.R. and Rabani, S.T., 2021. Identifying propaganda from online social networks during COVID-19 using machine learning techniques. International Journal of Information Technology (Singapore), 13, pp.115-122.
Kudugunta, S. and Ferrara, E., 2018. Deep neural networks for bot detection. Information Sciences, 467, pp.312-322.
Martin-Gutierrez, D., Hernandez-Penaloza, G., Hernandez, A.B., Lozano-Diez, A. and Alvarez, F., 2021. A deep learning approach for robust detection of bots in twitter using transformers. IEEE Access, 9, pp.54591-54601.
Orabi, M., Mouheb, D., Al Aghbari, Z. and Kamel, I., 2020a. Detection of bots in social media: Asystematic review. Information Processing and Management, 57, p.102250.
Orabi, M., Mouheb, D., Al Aghbari, Z. and Kamel, I., 2020b. Detection of bots in social media: Asystematic review. Information Processing and Management, 57, p.102250.
Peng, H., Long, F. and Ding, C., 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, pp.1226-1238.
Rodríguez-Ruiz, J., Mata-Sánchez, J.I., Monroy, R., Loyola-González, O. and López-Cuevas, A., 2020. A one-class classification approach for bot detection on Twitter. Computers and Security, 91, 101715.
Shannon, C.E. and Weaver, W., 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana, IL.
Shaw, R.G. and Mitchell-Olds, T., 1993. ANOVA for unbalanced data: An overview. Ecology, 74, pp.1638-1645.
Shevtsov, A., Tzagkarakis, C., Antonakaki, D. and Ioannidis, S., 2021. Identification of Twitter Bots Based on an Explainable Machine Learning Framework: The US 2020 Elections Case Study. Proceedings of the International AAAI Conference on Web and Social Media.
Shukla, H., Jagtap, N. and Patil, B., 2021. Enhanced twitter bot detection using ensemble machine learning. In: Proceedings of the 6th International Conference on Inventive Computation Technologies, ICICT 2021. Institute of Electrical and Electronics Engineers Inc., Piscataway, New Jersey. pp.930-936.
Varol, O., Ferrara, E., Davis, C.A., Menczer, F. and Flammini, A., 2017. Online Human-bot Interactions: Detection, Estimation, and Characterization. In: Proceedings of the 11th International Conference on Web and Social Media, ICWSM 2017, pp.280-289.
Wang, J., Xu, M., Wang, H. and Zhang, J., 2006. Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding. In: 2006 8th International Conference on Signal Processing. IEEE.
Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 3, pp.408-421.
Yang, K.C., Varol, O., Hui, P.M. and Menczer, F., 2019. Scalable and generalizable social bot detection through data selection. arXiv, 2019, p. 09179.
Copyright (c) 2022 Niyaz Jalal, Kayhan Z. Ghafoor
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0] that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).