An Investigation on Disparity Responds of Machine Learning Algorithms to Data Normalization Method
Data normalization can be useful in eliminating the effect of inconsistent ranges in some machine learning (ML) techniques and in speeding up the optimization process in others. Many studies apply different methods of data normalization with an aim to reduce or eliminate the impact of data variance on the accuracy rate of ML-based models. However, the significance of this impact aligning with the mathematical concept of the ML algorithms still needs more investigation and tests. To identify that, this work proposes an investigation methodology involving three different ML algorithms, which are support vector machine (SVM), artificial neural network (ANN), and Euclidean-based K-nearest neighbor (E-KNN). Throughout this work, five different datasets have been utilized, and each has been taken from different application fields with different statistical properties. Although there are many data normalization methods available, this work focuses on the min-max method, because it actively eliminates the effect of inconsistent ranges of the datasets. Moreover, other factors that are challenging the process of min-max normalization, such as including or excluding outliers or the least significant feature, have also been considered in this work. The finding of this work shows that each ML technique responds differently to the min-max normalization. The performance of SVM models has been improved, while no significant improvement happened to the performance of ANN models. It is been concluded that the performance of E-KNN models may improve or degrade with the min-max normalization, and it depends on the statistical properties of the dataset.
Ahsan, M., Mahmud, M.A., Saha, P.K., Gupta, K.D. and Siddique, Z. 2021. Effect of data scaling methods on machine learning algorithms and model performance. Technologies, 9, p.52.
Aksu, G., Güzeller, C.G. and Eser, M.T. 2019. The effect of the normalization method used in different sample sizes on the success of artificial neural network model. International Journal of Assessment Tools in Education, 6, pp.170-92.
Ali, P.J.M., 2022. Investigating the Impact of min-max data normalization on the regression performance of K-nearest neighbor with different similarity measurements. ARO The Scientific Journal of Koya University, 10, p.10955.
Ambarwari, A., Adrian, Q.J. and Herdiyeni, Y. 2020. Analysis of the effect of data scaling on the performance of the machine learning algorithm for plant identification. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 4, pp.117-122.
Bhardwaj, C.A., Mishra, M. and Desikan, K. 2018. Dynamic Feature Scaling for K-Nearest Neighbor Algorithm.
Brooks, T.F., Pope, D.T. and Marcolini, M.A. 1989. Airfoil Self-noise and Prediction (NASA Reference Publication). In: Technical Report 1218. National Aeronautics and Space Administration, United States.
Dadzie, E. and Kwakye, K. 2021. Developing a Machine Learning Algorithm- Based Classification Models for the Detection of High-Energy Gamma Particles.
Dheeru, D. and Graff, C. 2019. UCI Machine Learning Repository. School of Information and Computer Science. Vol. 25. University of California, Irvine, CA, p27.
Jayalakshmi, T. and Santhakumaran, A. 2011. Statistical normalization and back propagationfor classification. Journal of Computer Theory and Engineering, 3 pp.89-93.
Kappal, S. 2019. Data normalization using median median absolute deviation MMAD based Z-Score for robust predictions vs. Min-max normalization. London Journal of Research in Science Natural and Formal, 19, pp.39-44.
Ogasawara, E., Martinez, L.V., De Oliveira, D., Zimbrão, G., Pappa, G.L. and Mattoso, M. 2010. Adaptive Normalization: A novel Data Normalization Approach for Non-stationary Time Series. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp.1-8.
Pires, I.M., Hussain, F., Garcia, N.M., Lameski, P. and Zdravevski, E. 2020. Homogeneous data normalization and deep learning: A case study in human activity classification. Future Internet, 12, pp.194.
Prasetyo, J., Setiawan, N.A. and Adji, T.B. 2020. Improving normalization method of higher-order neural network in the forecasting of oil production. In: EDP Sciences.
Rajeswari, D. and Thangavel, K., 2020. The performance of data normalization techniques on heart disease datasets. International Journal of Advanced Research in Engineering and Technology, 11, pp.2350-2357.
Rana, P.S. 2013. Physicochemical properties of protein tertiary structure data set. UCI Machine Learning Repository, pp. Available from: https://www.archive. ics. uci. edu/ml/datasets/Physicochemical+Properties+ of+Protein+ Tertiary+ Structure. [Last accessed 2022 Apr 01].
Sattari, M.A., Roshani, G.H., Hanus, R., Nazemi, E., 2021. Applicability of time-domain feature extraction methods and artificial intelligence in two-phase flow meters based on gamma-ray absorption technique. Measurement, 168, p.108474.
Shahriyari, L. 2017. Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma. Brief Bioinformation, 20, pp.985-94.
Shorman, A.R., Faris, H., Castillo, P.A., Merelo, J.J. and Al-Madi, N. 2018. The Influence of Input Data Standardization Methods on the Prediction Accuracy of Genetic Programming Generated Classifiers. IJCCI 2018-Proceedings of the 10th International Joint Conference on Computational Intelligence, pp.79-85.
Singh, B.K., Raipur, N.I.T., Verma, K. and Thoke, A.S. 2015. Investigations on impact of feature normalization techniques on classifier’s performance in breast tumor classification. International Journal of Computer Applications, 116, pp.11-15.
Tüfekci, P. 2014. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power and Energy Systems, 60, pp.126-40.
Yeh, I.C. 1998. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28, pp.1797-1808.
Yeh, I.C. and Hsu, T.K. 2018. Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, pp.260-271.
Copyright (c) 2022 Haval A. Ahmed, Peshawa J. Muhammad Ali, Abdulbasit K. Faeq, Saman M. Abdullah
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0] that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).