A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures
Abstract
This paper aims to introduce an optimized Damerau–Levenshtein and dice-coefficients using enumeration operations (ODADNEN) for providing fast string similarity measure with maintaining the results accuracy; searching to find specific words within a large text is a hard job which takes a lot of time and efforts. The string similarity measure plays a critical role in many searching problems. In this paper, different experiments were conducted to handle some spelling mistakes. An enhanced algorithm for string similarity assessment was proposed. This algorithm is a combined set of well-known algorithms with some improvements (e.g. the dice-coefficient was modified to deal with numbers instead of characters using certain conditions). These algorithms were adopted after conducting on a number of experimental tests to check its suitability. The ODADNN algorithm was tested using real data; its performance was compared with the original similarity measure. The results indicated that the most convincing measure is the proposed hybrid measure, which uses the Damerau–Levenshtein and dicedistance based on n-gram of each word to handle; also, it requires less processing time in comparison with the standard algorithms. Furthermore, it provides efficient results to assess the similarity between two words without the need to restrict the word length.
Downloads
References
Alberto, B., Paolo, R., Eneko, A. and Gorka, L., 2010. Plagiarism detection across distant language Pairs. In: Proceedings of the 23rd International Conference on Computational Linguistics. pp.37-45.
Bunke, H. and Bühler, U., 1992. Invariant shape recognition using string matching. In: Proceedings of 2nd International Conference on Automation. Robotics and Computer Vision, Singapore.
Burnard, L., 1976. The University of Oxford Text Archive University of Oxford. Available from: http://www.ota.ox.ac.uk/catalogue/index.html.
Cortelazzo, G., Deretta, G., Mian, G.A. and Zamperoni, P., 1996. Normalized weighted Levensthein distance and triangle inequality in the context of similarity discrimination of bilevel images. Pattern Recognition Letters, 17(5), pp.431-436.
Cortelazzo, G., Mian, G.A., Vezzi, G. and Zamperoni, P., 1994. Trademark shapes description by string-matching techniques. Pattern Recognition, 27(8), pp.1005-1018.
Damerau, F.J., 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), pp.171-176.
Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology, 26(3), pp.297-302.
Fenz, D., Lange, D., Rheinländer, A., Naumann, F. and Leser, U., 2012. Efficient similarity search in very large string sets. In: International Conference on Scientific and Statistical Database Management. Springer Berlin, Heidelberg, pp262-279.
Gomaa, W.H. and Fahmy, A.A., 2013. A survey of text similarity approaches. International Journal of Computer Applications, 68(13), 13-18.
Hall, P.A. and Dowling, G.R., 1980. Approximate string matching. ACM Computing Surveys (CSUR), 12(4), pp.381-402.
Kashiap, R.L. and Oommen, B.J., 1984. String correction using probabilistic models. Pattern Recognition Letters, 2, pp.147-154.
Leusch, G., Ueffing, N. and Ney, H., 2003. A novel string-to-string distance measure with applications to machine translation evaluation. In: Proceedings of MT Summit IX, pp.240-247.
Martins, B., 2011. A supervised machine learning approach for duplicate detection over gazetteer records. In: Proceedings of the 4th International Conference on Geospatial Semantics. Springer, Berlin Heidelberg, pp.34-51.
Marzal, A. and Vidal, E., 1993. Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9), pp.926-932.
Mohri, M., 2003. Edit-distance of weighted automata: General definitions and algorithms. International Journal of Foundations of Computer Science, 14(6), pp.957-982.
Oommen, B.J., 1987. Recognition of noisy subsequences using constrained edit distances. IEEE Transactions on Pattern Analysis and Machine Intelligence,5, pp.676-685.
Pande, B.P., Pawan, T. and Dhami, H.S., 2013. Generation, implementation and appraisal of an N-gram based stemming algorithm. ArXivpreprint arXiv 1312- 4824. Available from: https://arxiv.org/ftp/arxiv/papers/1312/1312.4824.pdf.
Patel, D., 2016. Study of distance measurement techniques in context to prediction model of web caching and web prefetching. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), 5(1), 1-8.
Peng, H.L. and Chen, S.Y., 1997. Trademark shape recognition using closed contours. Pattern Recognition Letters, 18(8), pp.791-803.
Pradhan, N., Gyanchandan, M. and Wadhvani, R., 2015. A review on text similarity technique used in IR and its application. International Journal of Computer Applications, 120(9), pp.29-34.
Rieck, K. and Wressnegger, C., 2016. Harry: A tool for measuring string similarity. Journal of Machine Learning Research, 17(9), pp.1-5. Sehgal, V., Getoor, L. and Viechnicki, P.D., 2006. Entity resolution in geospatial
data integration. In: Proceedings of the 14th Annual ACM International Symposium on Advances in Geographic Information Systems. ACM, New York, NY, pp.83-90.
Sellers, P.H., 1980. The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, 1(4), pp.359-373.
Wang, J., Li, G. and Feng, J., 2014. Extending string similarity join to tolerant fuzzy token matching. ACM Transaction Database Systems, 39(1), pp.1-45.
Winkler, W.E., 1999. The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau. Available from: http://www.census.gov/srd/www/byname.html.
Copyright (c) 2017 Safa S. Abdul-Jabbar, Loay E. George
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors who choose to publish their work with Aro agree to the following terms:
-
Authors retain the copyright to their work and grant the journal the right of first publication. The work is simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0]. This license allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors have the freedom to enter into separate agreements for the non-exclusive distribution of the journal's published version of the work. This includes options such as posting it to an institutional repository or publishing it in a book, as long as proper acknowledgement is given to its initial publication in this journal.
-
Authors are encouraged to share and post their work online, including in institutional repositories or on their personal websites, both prior to and during the submission process. This practice can lead to productive exchanges and increase the visibility and citation of the published work.
By agreeing to these terms, authors acknowledge the importance of open access and the benefits it brings to the scholarly community.