Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. In this paper, trees are represented as multi-dimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate this similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating this similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in kNN similarity search, using 3 different datasets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly.
|Number of pages||13|
|Journal||IEEE Transactions on Knowledge and Data Engineering|
|Publication status||Published (in print/issue) - Feb 2012|