A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

Keywords: Word Classification, Word Clustering, String Distance, String matching operation, String Similarity metric

Abstract

This paper aims to introduce an optimized Damerau–Levenshtein and dice-coefficients using enumeration operations (ODADNEN) for providing fast string similarity measure with maintaining the results accuracy; searching to find specific words within a large text is a hard job which takes a lot of time and efforts. The string similarity measure plays a critical role in many searching problems. In this paper, different experiments were conducted to handle some spelling mistakes. An enhanced algorithm for string similarity assessment was proposed. This algorithm is a combined set of well-known algorithms with some improvements (e.g. the dice-coefficient was modified to deal with numbers instead of characters using certain conditions). These algorithms were adopted after conducting on a number of experimental tests to check its suitability. The ODADNN algorithm was tested using real data; its performance was compared with the original similarity measure. The results indicated that the most convincing measure is the proposed hybrid measure, which uses the Damerau–Levenshtein and dicedistance based on n-gram of each word to handle; also, it requires less processing time in comparison with the standard algorithms. Furthermore, it provides efficient results to assess the similarity between two words without the need to restrict the word length.

Downloads

Download data is not yet available.

Author Biographies

Safa S. Abdul-Jabbar, Computer Department, College of Science for Women, University of Baghdad, Baghdad
She received her B.Sc degree in the field of computer science from University of Baghdad/ College of Science for Women, in 2009. Then, she received her master's degree from same University, at 2017. Now, she is working as assistant lecturer in computer science department at Baghdad University.
Loay E. George, Computer Department, College of Science, University of Baghdad, Baghdad
He has a Ph.D. degree in Physics/Digital Image Processing from University of Baghdad-College of Science. His main research interests are in multimedia, compression, artificial intelligence. Currently, he is working as Assistant Professor at the same University.

References

Alberto, B., Paolo, R., Eneko, A. and Gorka, L., 2010. Plagiarism detection across distant language Pairs. In: Proceedings of the 23rd International Conference on Computational Linguistics. pp.37-45.

Bunke, H. and Bühler, U., 1992. Invariant shape recognition using string matching. In: Proceedings of 2nd International Conference on Automation. Robotics and Computer Vision, Singapore.

Burnard, L., 1976. The University of Oxford Text Archive University of Oxford. Available from: http://www.ota.ox.ac.uk/catalogue/index.html.

Cortelazzo, G., Deretta, G., Mian, G.A. and Zamperoni, P., 1996. Normalized weighted Levensthein distance and triangle inequality in the context of similarity discrimination of bilevel images. Pattern Recognition Letters, 17(5), pp.431-436.

Cortelazzo, G., Mian, G.A., Vezzi, G. and Zamperoni, P., 1994. Trademark shapes description by string-matching techniques. Pattern Recognition, 27(8), pp.1005-1018.

Damerau, F.J., 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), pp.171-176.

Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology, 26(3), pp.297-302.

Fenz, D., Lange, D., Rheinländer, A., Naumann, F. and Leser, U., 2012. Efficient similarity search in very large string sets. In: International Conference on Scientific and Statistical Database Management. Springer Berlin, Heidelberg, pp262-279.

Gomaa, W.H. and Fahmy, A.A., 2013. A survey of text similarity approaches. International Journal of Computer Applications, 68(13), 13-18.

Hall, P.A. and Dowling, G.R., 1980. Approximate string matching. ACM Computing Surveys (CSUR), 12(4), pp.381-402.

Kashiap, R.L. and Oommen, B.J., 1984. String correction using probabilistic models. Pattern Recognition Letters, 2, pp.147-154.

Leusch, G., Ueffing, N. and Ney, H., 2003. A novel string-to-string distance measure with applications to machine translation evaluation. In: Proceedings of MT Summit IX, pp.240-247.

Martins, B., 2011. A supervised machine learning approach for duplicate detection over gazetteer records. In: Proceedings of the 4th International Conference on Geospatial Semantics. Springer, Berlin Heidelberg, pp.34-51.

Marzal, A. and Vidal, E., 1993. Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9), pp.926-932.

Mohri, M., 2003. Edit-distance of weighted automata: General definitions and algorithms. International Journal of Foundations of Computer Science, 14(6), pp.957-982.

Oommen, B.J., 1987. Recognition of noisy subsequences using constrained edit distances. IEEE Transactions on Pattern Analysis and Machine Intelligence,5, pp.676-685.

Pande, B.P., Pawan, T. and Dhami, H.S., 2013. Generation, implementation and appraisal of an N-gram based stemming algorithm. ArXivpreprint arXiv 1312- 4824. Available from: https://arxiv.org/ftp/arxiv/papers/1312/1312.4824.pdf.

Patel, D., 2016. Study of distance measurement techniques in context to prediction model of web caching and web prefetching. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), 5(1), 1-8.

Peng, H.L. and Chen, S.Y., 1997. Trademark shape recognition using closed contours. Pattern Recognition Letters, 18(8), pp.791-803.

Pradhan, N., Gyanchandan, M. and Wadhvani, R., 2015. A review on text similarity technique used in IR and its application. International Journal of Computer Applications, 120(9), pp.29-34.

Rieck, K. and Wressnegger, C., 2016. Harry: A tool for measuring string similarity. Journal of Machine Learning Research, 17(9), pp.1-5. Sehgal, V., Getoor, L. and Viechnicki, P.D., 2006. Entity resolution in geospatial

data integration. In: Proceedings of the 14th Annual ACM International Symposium on Advances in Geographic Information Systems. ACM, New York, NY, pp.83-90.

Sellers, P.H., 1980. The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, 1(4), pp.359-373.

Wang, J., Li, G. and Feng, J., 2014. Extending string similarity join to tolerant fuzzy token matching. ACM Transaction Database Systems, 39(1), pp.1-45.

Winkler, W.E., 1999. The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau. Available from: http://www.census.gov/srd/www/byname.html.

Published
2017-10-21
How to Cite
Abdul-Jabbar, S. S. and George, L. E. (2017) “A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures”, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 5(2), pp. 6-18. doi: 10.14500/aro.10180.
Section
Articles