An Investigation on Disparity Responds of Machine Learning Algorithms to Data Normalization Method

  • Haval A. Ahmed Department of Software Engineering, Faculty of Engineering, Koya University, Koya KOY45, Kurdistan Region, F.R. Iraq https://orcid.org/0000-0002-0238-0874
  • Peshawa J. Muhammad Ali Department of Software Engineering, Faculty of Engineering, Koya University, Koya KOY45, Kurdistan Region, F.R. Iraq https://orcid.org/0000-0002-0471-5172
  • Abdulbasit K. Faeq Department of Software Engineering, Faculty of Engineering, Koya University, Koya KOY45, Kurdistan Region, F.R. Iraq https://orcid.org/0000-0001-6328-204X
  • Saman M. Abdullah (1) Department of Software Engineering, Faculty of Engineering, Koya University, Koya KOY45, Kurdistan region – F.R. Iraq. (2) Department of Computer Engineering, Faculty of Engineering, Tishk International University, Erbil, Kurdistan Region - F.R. Iraq
Keywords: Min-max normalization, Support vector machine, Artificial neural network, Euclidean-based K-nearest neighbor, Mean squared error

Abstract

Data normalization can be useful in eliminating the effect of inconsistent ranges in some machine learning (ML) techniques and in speeding up the optimization process in others. Many studies apply different methods of data normalization with an aim to reduce or eliminate the impact of data variance on the accuracy rate of ML-based models. However, the significance of this impact aligning with the mathematical concept of the ML algorithms still needs more investigation and tests. To identify that, this work proposes an investigation methodology involving three different ML algorithms, which are support vector machine (SVM), artificial neural network (ANN), and Euclidean-based K-nearest neighbor (E-KNN). Throughout this work, five different datasets have been utilized, and each has been taken from different application fields with different statistical properties. Although there are many data normalization methods available, this work focuses on the min-max method, because it actively eliminates the effect of inconsistent ranges of the datasets. Moreover, other factors that are challenging the process of min-max normalization, such as including or excluding outliers or the least significant feature, have also been considered in this work. The finding of this work shows that each ML technique responds differently to the min-max normalization. The performance of SVM models has been improved, while no significant improvement happened to the performance of ANN models. It is been concluded that the performance of E-KNN models may improve or degrade with the min-max normalization, and it depends on the statistical properties of the dataset.

Downloads

Download data is not yet available.

Author Biographies

Haval A. Ahmed, Department of Software Engineering, Faculty of Engineering, Koya University, Koya KOY45, Kurdistan Region, F.R. Iraq

Haval A. Ahmed is an Assistant Lecturer at the Department of Software Engineering, Koya University. He got the B.Sc. and M.Sc. degrees in Software Engineering from Salahaddin University-Erbil, in 2007 and 2014, respectively. His research interests include neural networks, fuzzy systems, computer vision, face detection and recognition, and open source technology. Mr. Haval is a practitioner Software Engineer at the Kurdistan Engineers Union, Iraq.

Peshawa J. Muhammad Ali, Department of Software Engineering, Faculty of Engineering, Koya University, Koya KOY45, Kurdistan Region, F.R. Iraq

Peshawa Jamal Muhammad Ali is an Assistant Professor of Computer Science at the Faculty of Engineering, Koya University. His research interests focus on machine learning, data science, and data mining.

Abdulbasit K. Faeq, Department of Software Engineering, Faculty of Engineering, Koya University, Koya KOY45, Kurdistan Region, F.R. Iraq

Abdulbasit Al-Talabani is an Assistant Prof. at the Department of Software Engineering, Faculty of Engineering, Koya University. He has a B.Sc. in mathematics at Salahadin University/Iraq, M.Sc. in Computer Science, Koya University, Iraq, and a PhD degree at applied computing, Buckingham University, UK. His research interest is in machine learning, speech processing and computer vision.

Saman M. Abdullah, (1) Department of Software Engineering, Faculty of Engineering, Koya University, Koya KOY45, Kurdistan region – F.R. Iraq. (2) Department of Computer Engineering, Faculty of Engineering, Tishk International University, Erbil, Kurdistan Region - F.R. Iraq

Saman M. Abdullah is an Assistant Prof. at the Department of Software Engineering, Faculty of Engineering Koya University. He got the B.Sc. degree in Electronic Einggering, the M.Sc. degree in Computer Security and the Ph.D. degree in Malware Detection Systems. His research interests are in IoT Security, Machine Learning and Data Science. Dr. Saman is a member of IEEE and ACM Society.

 

References

Ahsan, M., Mahmud, M.A., Saha, P.K., Gupta, K.D. and Siddique, Z. 2021. Effect of data scaling methods on machine learning algorithms and model performance. Technologies, 9, p.52.

Aksu, G., Güzeller, C.G. and Eser, M.T. 2019. The effect of the normalization method used in different sample sizes on the success of artificial neural network model. International Journal of Assessment Tools in Education, 6, pp.170-92.

Ali, P.J.M., 2022. Investigating the Impact of min-max data normalization on the regression performance of K-nearest neighbor with different similarity measurements. ARO The Scientific Journal of Koya University, 10, p.10955.

Ambarwari, A., Adrian, Q.J. and Herdiyeni, Y. 2020. Analysis of the effect of data scaling on the performance of the machine learning algorithm for plant identification. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 4, pp.117-122.

Bhardwaj, C.A., Mishra, M. and Desikan, K. 2018. Dynamic Feature Scaling for K-Nearest Neighbor Algorithm.

Brooks, T.F., Pope, D.T. and Marcolini, M.A. 1989. Airfoil Self-noise and Prediction (NASA Reference Publication). In: Technical Report 1218. National Aeronautics and Space Administration, United States.

Dadzie, E. and Kwakye, K. 2021. Developing a Machine Learning Algorithm- Based Classification Models for the Detection of High-Energy Gamma Particles.

Dheeru, D. and Graff, C. 2019. UCI Machine Learning Repository. School of Information and Computer Science. Vol. 25. University of California, Irvine, CA, p27.

Jayalakshmi, T. and Santhakumaran, A. 2011. Statistical normalization and back propagationfor classification. Journal of Computer Theory and Engineering, 3 pp.89-93.

Kappal, S. 2019. Data normalization using median median absolute deviation MMAD based Z-Score for robust predictions vs. Min-max normalization. London Journal of Research in Science Natural and Formal, 19, pp.39-44.

Ogasawara, E., Martinez, L.V., De Oliveira, D., Zimbrão, G., Pappa, G.L. and Mattoso, M. 2010. Adaptive Normalization: A novel Data Normalization Approach for Non-stationary Time Series. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp.1-8.

Pires, I.M., Hussain, F., Garcia, N.M., Lameski, P. and Zdravevski, E. 2020. Homogeneous data normalization and deep learning: A case study in human activity classification. Future Internet, 12, pp.194.

Prasetyo, J., Setiawan, N.A. and Adji, T.B. 2020. Improving normalization method of higher-order neural network in the forecasting of oil production. In: EDP Sciences.

Rajeswari, D. and Thangavel, K., 2020. The performance of data normalization techniques on heart disease datasets. International Journal of Advanced Research in Engineering and Technology, 11, pp.2350-2357.

Rana, P.S. 2013. Physicochemical properties of protein tertiary structure data set. UCI Machine Learning Repository, pp. Available from: https://www.archive. ics. uci. edu/ml/datasets/Physicochemical+Properties+ of+Protein+ Tertiary+ Structure. [Last accessed 2022 Apr 01].

Sattari, M.A., Roshani, G.H., Hanus, R., Nazemi, E., 2021. Applicability of time-domain feature extraction methods and artificial intelligence in two-phase flow meters based on gamma-ray absorption technique. Measurement, 168, p.108474.

Shahriyari, L. 2017. Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma. Brief Bioinformation, 20, pp.985-94.

Shorman, A.R., Faris, H., Castillo, P.A., Merelo, J.J. and Al-Madi, N. 2018. The Influence of Input Data Standardization Methods on the Prediction Accuracy of Genetic Programming Generated Classifiers. IJCCI 2018-Proceedings of the 10th International Joint Conference on Computational Intelligence, pp.79-85.

Singh, B.K., Raipur, N.I.T., Verma, K. and Thoke, A.S. 2015. Investigations on impact of feature normalization techniques on classifier’s performance in breast tumor classification. International Journal of Computer Applications, 116, pp.11-15.

Tüfekci, P. 2014. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power and Energy Systems, 60, pp.126-40.

Yeh, I.C. 1998. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28, pp.1797-1808.

Yeh, I.C. and Hsu, T.K. 2018. Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, pp.260-271.

Published
2022-09-19
How to Cite
Ahmed, H. A., Muhammad Ali, P. J., Faeq, A. K. and Abdullah, S. M. (2022) “An Investigation on Disparity Responds of Machine Learning Algorithms to Data Normalization Method”, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 10(2), pp. 29-37. doi: 10.14500/aro.10970.
Section
Review Articles