Web Page Ranking Based on Text Content and Link Information Using Data Mining Techniques

Keywords: Information retrieval, JSON API, Programmable (CSE), Search engine, World Wide Web, Web page ranking

Abstract

Thanks to the rapid expansion of the Internet, anyone can now access a vast array of information online. However, as the volume of web content continues to grow exponentially, search engines face challenges in delivering relevant results. Early search engines primarily relied on the words or phrases found within web pages to index and rank them. While this approach had its merits, it often resulted in irrelevant or inaccurate results. To address this issue, more advanced search engines began incorporating the hyperlink structures of web pages to help determine their relevance. While this method improved retrieval accuracy to some extent, it still had limitations, as it did not consider the actual content of web pages. The objective of the work is to enhance Web Information Retrieval methods by leveraging three key components: text content analysis, link analysis, and log file analysis. By integrating insights from these multiple data sources, the goal is to achieve a more accurate and effective ranking of relevant web pages in the retrieved document set, ultimately enhancing the user experience and delivering more precise search results the proposed system was tested with both multi-word and single-word queries, and the results were evaluated using metrics such as relative recall, precision, and F-measure. When compared to Google’s PageRank algorithm, the proposed system demonstrated superior performance, achieving an 81% mean average precision, 56% average relative recall, and a 66% F-measure.

Downloads

Download data is not yet available.

References

Afolabi, I.T., Makinde, O.S., and Oladipupo, O.O., 2019. Semantic web mining for content-based online shopping recommender systems. International Journal of Intelligent Information Technologies, 15(4), pp.41-56. DOI: https://doi.org/10.4018/IJIIT.2019100103

Al-Anzi, F., and Abuzeina, D., 2020. Enhanced latent semantic indexing using cosine similarity measures for medical application. International Arab Journal of Information Technology, 17(5), pp.742-749. DOI: https://doi.org/10.34028/iajit/17/5/7

Alhaidari, F., Alwarthan, S., and Alamoudi, A., 2020. User preference based weighted page ranking algorithm. In: ICCAIS 2020-3rd International Conference on Computer Applications and Information Security, pp.1-6. DOI: https://doi.org/10.1109/ICCAIS48893.2020.9096823

Ali, F., and Khusro, S., 2021. Content and link-structure perspective of ranking webpages: A review. Computer Science Review, 40, p.100397. DOI: https://doi.org/10.1016/j.cosrev.2021.100397

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., and Kochut, K., 2017. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. Journal of Intelligent Information Systems, 2017, 1(1), pp.1-13.

Ghani, W.A., and Hussain, A., 2021. Applying similarity measures to improve query expansion. Iraqi Journal of Science, 62(6), pp.2053-2063. DOI: https://doi.org/10.24996/ijs.2021.62.6.31

Guwta, M., 2021. Information Retrieval for Silt’e Text Using Latent Semantic Indexing. M.C. Thesis. Bahir Dar University.

Hazarika, D., Konwar, D., and Bora, D.J., 2020. Sentiment Analysis on Twitter by Using TextBlob for Natural Language Processing. In: Proceedings of the International Conference on Research in Management and Technovation 2020. Vol. 24, pp.63-67. DOI: https://doi.org/10.15439/2020KM20

Ilo, P.I., Nkiko, C., Izuagbe, R., and Furfuri, I.M.M., 2023. Course Guide Lis 303 Information Retrieval (Cataloguing ii). National Open University of Nigeria, Nsukka.Thakur, N., Mehrotra, D., Bansal A., and Bala M., 2019. Comparative analysis of ranking functions for retrieving information from medical repository. Malaysian Journal of Computer Science, 32(1), pp.18-30. DOI: https://doi.org/10.22452/mjcs.vol32no1.2

Jain, S., Jain, S.C., and Vishwakarma, S.K., 2020. Analysis of text classification with various term weighting schemes in vector space model. International Journal of Innovative Technology and Exploring Engineering, 9(10), pp.390-393. DOI: https://doi.org/10.35940/ijitee.D1938.0891020

Jain, S., Vishwakarma, S., and Jain, S.C., 2023. Analysis of term weighting schemes in vector space model for text classification. Journal of Integrated Science and Technology, 11(2), p.469.

Joby, P.P., 2020. Expedient information retrieval system for web pages using the natural language modelling. Journal of Artificial Intelligence and Capsule Networks, 2(2), pp.100-110. DOI: https://doi.org/10.36548/jaicn.2020.2.003

Kleinberg, J.M., 2011. Authoritative sources in a hyperlinked environment. In: The Structure and Dynamics of Networks. Princeton University Press, Princeton, pp.514-542. DOI: https://doi.org/10.1515/9781400841356.514

Lu, J., Henchion, M., and Namee, B.M., 2020. Diverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison Tasks. In: LREC 2020-12th International Conference on Language Resources and Evaluation, Conference Proceedings. Vol. 2, pp.6740-6744.

Mustafa, A.B., Ghulam, S.K., Naadiya, M., and Sheeba, M., 2022. Web content mining techniques for structured data: A review. Sindh Journal of Headways in Software Engineering, 1(1), pp.1-10.

Nassar, M.O., Kanaan, G., and Awad, H.A.H., 2010. Comparison between Different Global Weighting Schemes. In: Proceedings of the International MultiConference of Engineers and Computer Scientists 2010, IMECS 2010. Vol. I, pp.690-692.

Patel, S.H., and Desai, A.A., 2019. Link analysis to discover relevant documents using information retrieval. International Journal of Computer Applications, 178(10), pp.23-27. DOI: https://doi.org/10.5120/ijca2019918827

Payal, L.S., 2020. A study of different web mining types. Anveshana’s International Journal of Research in Engineering and Applied Sciences, 5(3), pp.30-33.

Phyu, A.P., and Thu, E.E., 2021. Short survey of data mining and web mining using cloud computing. International Journal of Advanced Networking and Applications, 12(05), pp.4725-4731. DOI: https://doi.org/10.35444/IJANA.2021.12509

Qi, Q., Hessen, D.J., and van der Heijden, P.G.M., 2023. Improving Information Retrieval Through Correspondence Analysis Instead of Latent Semantic Analysis. Journal of Intelligent Information Systems, 2023, 1(1), pp.1-44. DOI: https://doi.org/10.1007/s10844-023-00815-y

Rathi, R.N., and Mustafi, A., 2023. The importance of term weighting in semantic understanding of text: A review of techniques. Multimedia Tools and Applications, 82(7), pp.9761-9783. DOI: https://doi.org/10.1007/s11042-022-12538-3

Reddy, K.P., Reddy, T.R., Naidu, G.A., and Vardhan, B.V., 2018. Impact of similarity measures in information retrieval. International Journal of Computational Engineering Research, 8(6), pp.54-59.

Robert, B., and Brown, E.B., 2004. The PageRank Citation Ranking: Bringing Order to the Web. Vol. 1, University of Pennsylvania, Philadelphia, PA, pp.1-14.

Shahmirzadi, O., Lugowski, A., and Younge, K., 2019. Text Similarity in Vector Space Models: A Comparative Study. In: Proceeding-18th IEEE International Conference on Machine Learning and Applications, ICMLA 2019, pp.659-666. DOI: https://doi.org/10.1109/ICMLA.2019.00120

Sharma, D., Shukla, R., Giri, A.K., and Kumar, S., 2019. A Brief Review on Search ENGINE Optimization. In: Proceedings of the 9th International Conference On Cloud Computing, Data Science and Engineering, Confluence 2019, pp.687-692. DOI: https://doi.org/10.1109/CONFLUENCE.2019.8776976

Sharma, P.S., Yadav, D., and Garg, P., 2020. A systematic review on page ranking algorithms. International Journal of Information Technology, 12(2), pp.329-337. DOI: https://doi.org/10.1007/s41870-020-00439-3

Sharma, P.S., Yadav, D., and Thakur, R.N., 2022. Web page ranking using web mining techniques: A comprehensive survey. Mobile Information Systems, 2022, p.7519573. DOI: https://doi.org/10.1155/2022/7519573

Tyagi, N., and Gupta, S.K., 2018. Web structure mining algorithms: A survey. Advances in Intelligent Systems and Computing, 654, pp.305-317. DOI: https://doi.org/10.1007/978-981-10-6620-7_30

Wang, J., and Dong, Y., 2020. Measurement of text similarity: A survey. Information, 11(9), p.421. DOI: https://doi.org/10.3390/info11090421

Wu, H., and Gu, X., 2014. Reducing Over-weighting in Supervised Term Weighting for Sentiment Analysis. In: COLING 2014-25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers, pp.1322-1330.

Xing, W., and Ghorbani, A., 2004. Weighted PageRank Algorithm. In: Proceedings-Second Annual Conference on Communication Networks and Services Research, pp.305-314. DOI: https://doi.org/10.1109/DNSR.2004.1344743

Zheng, W., and Fang, H., 2010. ARetrieval System based on Sentiment Analysis. HCIR. [Preprint].

Published
2024-02-16
How to Cite
Naamha, E. Q. and Abdulmunim, M. E. (2024) “Web Page Ranking Based on Text Content and Link Information Using Data Mining Techniques”, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 12(1), pp. 29-40. doi: 10.14500/aro.11397.