Knowledge Discovery from Semantically Heterogeneous Aggregate Databases Using Model-Based Clustering

Research output: Chapter in Book/Report/Conference proceedingChapter

1 Citation (Scopus)

Abstract

When distributed databases are developed independently, they may be semantically heterogeneous with respect to data granularity, scheme information and the embedded semantics. However, most traditional distributed knowledge discovery (DKD) methods assume that the distributed databases derive from a single virtual global table, where they share the same semantics and data structures. This data heterogeneity and the underlying semantics bring a considerable challenge for DKD. In this paper, we propose a model-based clustering method for aggregate databases, where the heterogeneous schema structure is due to the heterogeneous classification schema. The underlying semantics can be captured by different clusters. The clustering is carried out via a mixture model, where each component of the mixture corresponds to a different virtual global table. An advantage of our approach is that the algorithm resolves the heterogeneity as part of the clustering process without previously having to homogenise the heterogeneous local schema to a shared schema. Evaluation of the algorithm is carried out using both real and synthetic data. Scalability of the algorithm is tested against the number of databases to be clustered; the number of clusters; and the size of the databases. The relationship between performance and complexity is also evaluated. Our experiments show that this approach has good potential for scalable integration of semantically heterogeneous databases.
LanguageEnglish
Title of host publicationData Management. Data, Data Everywhere
Pages190-202
Volume4587
DOIs
Publication statusPublished - 19 Aug 2007

Fingerprint

Data mining
Semantics
Data structures
Scalability
Experiments

Cite this

@inbook{df89ca2f674a43299d817b34fefdcefb,
title = "Knowledge Discovery from Semantically Heterogeneous Aggregate Databases Using Model-Based Clustering",
abstract = "When distributed databases are developed independently, they may be semantically heterogeneous with respect to data granularity, scheme information and the embedded semantics. However, most traditional distributed knowledge discovery (DKD) methods assume that the distributed databases derive from a single virtual global table, where they share the same semantics and data structures. This data heterogeneity and the underlying semantics bring a considerable challenge for DKD. In this paper, we propose a model-based clustering method for aggregate databases, where the heterogeneous schema structure is due to the heterogeneous classification schema. The underlying semantics can be captured by different clusters. The clustering is carried out via a mixture model, where each component of the mixture corresponds to a different virtual global table. An advantage of our approach is that the algorithm resolves the heterogeneity as part of the clustering process without previously having to homogenise the heterogeneous local schema to a shared schema. Evaluation of the algorithm is carried out using both real and synthetic data. Scalability of the algorithm is tested against the number of databases to be clustered; the number of clusters; and the size of the databases. The relationship between performance and complexity is also evaluated. Our experiments show that this approach has good potential for scalable integration of semantically heterogeneous databases.",
author = "Shuai Zhang and Sally McClean and Bryan Scotney",
year = "2007",
month = "8",
day = "19",
doi = "10.1007/978-3-540-73390-4_22",
language = "English",
isbn = "978-3-540-73389-8",
volume = "4587",
pages = "190--202",
booktitle = "Data Management. Data, Data Everywhere",

}

Knowledge Discovery from Semantically Heterogeneous Aggregate Databases Using Model-Based Clustering. / Zhang, Shuai; McClean, Sally; Scotney, Bryan.

Data Management. Data, Data Everywhere. Vol. 4587 2007. p. 190-202.

Research output: Chapter in Book/Report/Conference proceedingChapter

TY - CHAP

T1 - Knowledge Discovery from Semantically Heterogeneous Aggregate Databases Using Model-Based Clustering

AU - Zhang, Shuai

AU - McClean, Sally

AU - Scotney, Bryan

PY - 2007/8/19

Y1 - 2007/8/19

N2 - When distributed databases are developed independently, they may be semantically heterogeneous with respect to data granularity, scheme information and the embedded semantics. However, most traditional distributed knowledge discovery (DKD) methods assume that the distributed databases derive from a single virtual global table, where they share the same semantics and data structures. This data heterogeneity and the underlying semantics bring a considerable challenge for DKD. In this paper, we propose a model-based clustering method for aggregate databases, where the heterogeneous schema structure is due to the heterogeneous classification schema. The underlying semantics can be captured by different clusters. The clustering is carried out via a mixture model, where each component of the mixture corresponds to a different virtual global table. An advantage of our approach is that the algorithm resolves the heterogeneity as part of the clustering process without previously having to homogenise the heterogeneous local schema to a shared schema. Evaluation of the algorithm is carried out using both real and synthetic data. Scalability of the algorithm is tested against the number of databases to be clustered; the number of clusters; and the size of the databases. The relationship between performance and complexity is also evaluated. Our experiments show that this approach has good potential for scalable integration of semantically heterogeneous databases.

AB - When distributed databases are developed independently, they may be semantically heterogeneous with respect to data granularity, scheme information and the embedded semantics. However, most traditional distributed knowledge discovery (DKD) methods assume that the distributed databases derive from a single virtual global table, where they share the same semantics and data structures. This data heterogeneity and the underlying semantics bring a considerable challenge for DKD. In this paper, we propose a model-based clustering method for aggregate databases, where the heterogeneous schema structure is due to the heterogeneous classification schema. The underlying semantics can be captured by different clusters. The clustering is carried out via a mixture model, where each component of the mixture corresponds to a different virtual global table. An advantage of our approach is that the algorithm resolves the heterogeneity as part of the clustering process without previously having to homogenise the heterogeneous local schema to a shared schema. Evaluation of the algorithm is carried out using both real and synthetic data. Scalability of the algorithm is tested against the number of databases to be clustered; the number of clusters; and the size of the databases. The relationship between performance and complexity is also evaluated. Our experiments show that this approach has good potential for scalable integration of semantically heterogeneous databases.

U2 - 10.1007/978-3-540-73390-4_22

DO - 10.1007/978-3-540-73390-4_22

M3 - Chapter

SN - 978-3-540-73389-8

VL - 4587

SP - 190

EP - 202

BT - Data Management. Data, Data Everywhere

ER -