Clustering semantically heterogeneous distributed aggregate databases

Research output: Contribution to journalArticle

Abstract

Databases developed independently in a common open distributed environment may be heterogeneous with respect to both data schema and the embedded semantics. Managing schema and semantic heterogeneities brings considerable challenges to learning from distributed data and to support applications involving cooperation between different organisations. In this paper, we are concerned mainly with heterogeneous databases that hold aggregates on a set of attributes, which are often the result of materialised views of native large-scale distributed databases. A model-based clustering algorithm is proposed to construct a mixture model where each component corresponds to a cluster which is used to capture the contextual heterogeneity among databases from different populations. Schema heterogeneity, which can be recast as incomplete information, is handled within the clustering process using Expectation-Maximisation estimation and integration is carried out within a clustering iteration. Our proposed algorithm resolves the schema heterogeneity as part of the clustering process, thus avoiding transformation of the data into a unified schema. Results of algorithm evaluation on classification, scalability and reliability, using both real and synthetic data, demonstrate that our algorithm can achieve good performance by incorporating all of the information from available heterogeneous data. Our clustering approach has great potential for scalable knowledge discovery from semantically heterogeneous databases and for applications in an open distributed environment, such as the Semantic Web.
LanguageEnglish
Pages331-364
JournalKnowledge and Information Systems
Volume38
Issue number2
DOIs
Publication statusPublished - Feb 2014

Fingerprint

Semantics
Clustering algorithms
Data mining
Scalability

Cite this

@article{669b57b966a840ac8c3d8fab9f1971bf,
title = "Clustering semantically heterogeneous distributed aggregate databases",
abstract = "Databases developed independently in a common open distributed environment may be heterogeneous with respect to both data schema and the embedded semantics. Managing schema and semantic heterogeneities brings considerable challenges to learning from distributed data and to support applications involving cooperation between different organisations. In this paper, we are concerned mainly with heterogeneous databases that hold aggregates on a set of attributes, which are often the result of materialised views of native large-scale distributed databases. A model-based clustering algorithm is proposed to construct a mixture model where each component corresponds to a cluster which is used to capture the contextual heterogeneity among databases from different populations. Schema heterogeneity, which can be recast as incomplete information, is handled within the clustering process using Expectation-Maximisation estimation and integration is carried out within a clustering iteration. Our proposed algorithm resolves the schema heterogeneity as part of the clustering process, thus avoiding transformation of the data into a unified schema. Results of algorithm evaluation on classification, scalability and reliability, using both real and synthetic data, demonstrate that our algorithm can achieve good performance by incorporating all of the information from available heterogeneous data. Our clustering approach has great potential for scalable knowledge discovery from semantically heterogeneous databases and for applications in an open distributed environment, such as the Semantic Web.",
author = "Shuai Zhang and Sally McClean and Bryan Scotney",
year = "2014",
month = "2",
doi = "10.1007/s10115-012-0588-4",
language = "English",
volume = "38",
pages = "331--364",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
number = "2",

}

Clustering semantically heterogeneous distributed aggregate databases. / Zhang, Shuai; McClean, Sally; Scotney, Bryan.

In: Knowledge and Information Systems, Vol. 38, No. 2, 02.2014, p. 331-364.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Clustering semantically heterogeneous distributed aggregate databases

AU - Zhang, Shuai

AU - McClean, Sally

AU - Scotney, Bryan

PY - 2014/2

Y1 - 2014/2

N2 - Databases developed independently in a common open distributed environment may be heterogeneous with respect to both data schema and the embedded semantics. Managing schema and semantic heterogeneities brings considerable challenges to learning from distributed data and to support applications involving cooperation between different organisations. In this paper, we are concerned mainly with heterogeneous databases that hold aggregates on a set of attributes, which are often the result of materialised views of native large-scale distributed databases. A model-based clustering algorithm is proposed to construct a mixture model where each component corresponds to a cluster which is used to capture the contextual heterogeneity among databases from different populations. Schema heterogeneity, which can be recast as incomplete information, is handled within the clustering process using Expectation-Maximisation estimation and integration is carried out within a clustering iteration. Our proposed algorithm resolves the schema heterogeneity as part of the clustering process, thus avoiding transformation of the data into a unified schema. Results of algorithm evaluation on classification, scalability and reliability, using both real and synthetic data, demonstrate that our algorithm can achieve good performance by incorporating all of the information from available heterogeneous data. Our clustering approach has great potential for scalable knowledge discovery from semantically heterogeneous databases and for applications in an open distributed environment, such as the Semantic Web.

AB - Databases developed independently in a common open distributed environment may be heterogeneous with respect to both data schema and the embedded semantics. Managing schema and semantic heterogeneities brings considerable challenges to learning from distributed data and to support applications involving cooperation between different organisations. In this paper, we are concerned mainly with heterogeneous databases that hold aggregates on a set of attributes, which are often the result of materialised views of native large-scale distributed databases. A model-based clustering algorithm is proposed to construct a mixture model where each component corresponds to a cluster which is used to capture the contextual heterogeneity among databases from different populations. Schema heterogeneity, which can be recast as incomplete information, is handled within the clustering process using Expectation-Maximisation estimation and integration is carried out within a clustering iteration. Our proposed algorithm resolves the schema heterogeneity as part of the clustering process, thus avoiding transformation of the data into a unified schema. Results of algorithm evaluation on classification, scalability and reliability, using both real and synthetic data, demonstrate that our algorithm can achieve good performance by incorporating all of the information from available heterogeneous data. Our clustering approach has great potential for scalable knowledge discovery from semantically heterogeneous databases and for applications in an open distributed environment, such as the Semantic Web.

U2 - 10.1007/s10115-012-0588-4

DO - 10.1007/s10115-012-0588-4

M3 - Article

VL - 38

SP - 331

EP - 364

JO - Knowledge and Information Systems

T2 - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 2

ER -