A multi-dimensional sequence approach to measuring tree similarity

Research output: Research - peer-reviewArticle

  • 11 Citations

Abstract

Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. In this paper, trees are represented as multi-dimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate this similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating this similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in kNN similarity search, using 3 different datasets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly.
LanguageEnglish
Pages197-208
Number of pages13
JournalIEEE Transactions on Knowledge and Data Engineering
Volume24
Issue number2
DOIs
StatePublished - Feb 2012

Fingerprint

Information retrieval
Dynamic programming
Computer science
Support vector machines
Data structures
Classifiers
Processing
Costs

Cite this

@article{f651b850604341b095f8875c7b7c8430,
title = "A multi-dimensional sequence approach to measuring tree similarity",
abstract = "Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. In this paper, trees are represented as multi-dimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate this similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating this similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in kNN similarity search, using 3 different datasets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly.",
author = "Zhiwei Lin and Hui Wang and Sally McClean",
year = "2012",
month = "2",
doi = "10.1109/TKDE.2010.239",
volume = "24",
pages = "197--208",
number = "2",

}

A multi-dimensional sequence approach to measuring tree similarity. / Lin, Zhiwei; Wang, Hui; McClean, Sally.

Vol. 24, No. 2, 02.2012, p. 197-208.

Research output: Research - peer-reviewArticle

TY - JOUR

T1 - A multi-dimensional sequence approach to measuring tree similarity

AU - Lin,Zhiwei

AU - Wang,Hui

AU - McClean,Sally

PY - 2012/2

Y1 - 2012/2

N2 - Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. In this paper, trees are represented as multi-dimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate this similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating this similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in kNN similarity search, using 3 different datasets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly.

AB - Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. In this paper, trees are represented as multi-dimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate this similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating this similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in kNN similarity search, using 3 different datasets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly.

U2 - 10.1109/TKDE.2010.239

DO - 10.1109/TKDE.2010.239

M3 - Article

VL - 24

SP - 197

EP - 208

IS - 2

ER -