Conceptual Clustering of Heterogeneous Gene Expression Sequences

SI McClean, BW Scotney, S Robinson

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

We are concerned with clustering and characterising gene expression sequences that have been classified according to heterogeneous classification schemes. We adopt a model-based approach that uses a Hidden Markov Model (HMM) that has as states the stages of the underlying process that generates the gene sequences, thus allowing us to handle complex and heterogeneous data. Each cluster is described in terms of a HMM where we seek to find schema mappings between the states of the original sequences and the states of the HMM.The general solution that we propose involves several distinct tasks. Firstly, there is a clustering problem where we seek to group similar sequences; for this we use mutual entropy to identify associations between sequence states. Secondly, because we are concerned with clustering heterogeneous sequences, we must determine the mappings between the states of each sequence in a cluster and the states of an underlying hidden process; for this we compute the most probable mapping. Thirdly, using these mappings we employ maximum likelihood techniques to learn the probabilistic description of the hidden Markov process for each cluster. Fourthly, we use these descriptions to characterise the clusters using Dynamic Programming to determine the most probable pathway for each cluster. Finally, we derive linguistic labels to describe the clusters in a user-friendly manner. Such an approach provides an intuitive way of describing the underlying shape of the process by explicitly modelling the temporal aspects of the data. Non time-homogeneous HMMs are used to capture the full temporal semantics.
LanguageEnglish
Pages53-73
JournalArtificial Intelligence Review (Special Issue on Life Science and AI)
Volume20
Issue number1-2
DOIs
Publication statusPublished - 1 Oct 2003

Fingerprint

Gene expression
Hidden Markov models
entropy
Dynamic programming
Linguistics
Markov processes
Maximum likelihood
Labels
Entropy
Genes
programming
Semantics
semantics
linguistics
Gene Expression
Group
Hidden Markov Model

Cite this

@article{3763f80bd4d240c1a505121a4252db8b,
title = "Conceptual Clustering of Heterogeneous Gene Expression Sequences",
abstract = "We are concerned with clustering and characterising gene expression sequences that have been classified according to heterogeneous classification schemes. We adopt a model-based approach that uses a Hidden Markov Model (HMM) that has as states the stages of the underlying process that generates the gene sequences, thus allowing us to handle complex and heterogeneous data. Each cluster is described in terms of a HMM where we seek to find schema mappings between the states of the original sequences and the states of the HMM.The general solution that we propose involves several distinct tasks. Firstly, there is a clustering problem where we seek to group similar sequences; for this we use mutual entropy to identify associations between sequence states. Secondly, because we are concerned with clustering heterogeneous sequences, we must determine the mappings between the states of each sequence in a cluster and the states of an underlying hidden process; for this we compute the most probable mapping. Thirdly, using these mappings we employ maximum likelihood techniques to learn the probabilistic description of the hidden Markov process for each cluster. Fourthly, we use these descriptions to characterise the clusters using Dynamic Programming to determine the most probable pathway for each cluster. Finally, we derive linguistic labels to describe the clusters in a user-friendly manner. Such an approach provides an intuitive way of describing the underlying shape of the process by explicitly modelling the temporal aspects of the data. Non time-homogeneous HMMs are used to capture the full temporal semantics.",
author = "SI McClean and BW Scotney and S Robinson",
note = "Other Details ------------------------------------ This paper describes a method of clustering and characterising gene expression sequences, classified according to heterogeneous classification schemes. We adopt a model-based approach using a Hidden Markov Model with states the stages of the underlying process, thus allowing us to handle complex and heterogeneous sequences. The approach was developed as part of the EU-IST MISSION project. It is published in a special issue of the AI Review that includes data mining methods for bioinformatics. The concepts are being used and further developed in the EPSRC and HPSSNI-funded RIGHT project, which is developing model based clustering algorithms for patient pathway sequences.",
year = "2003",
month = "10",
day = "1",
doi = "10.1023/A:1026036631075",
language = "English",
volume = "20",
pages = "53--73",
journal = "Artificial Intelligence Review",
issn = "0269-2821",
number = "1-2",

}

Conceptual Clustering of Heterogeneous Gene Expression Sequences. / McClean, SI; Scotney, BW; Robinson, S.

In: Artificial Intelligence Review (Special Issue on Life Science and AI), Vol. 20, No. 1-2, 01.10.2003, p. 53-73.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Conceptual Clustering of Heterogeneous Gene Expression Sequences

AU - McClean, SI

AU - Scotney, BW

AU - Robinson, S

N1 - Other Details ------------------------------------ This paper describes a method of clustering and characterising gene expression sequences, classified according to heterogeneous classification schemes. We adopt a model-based approach using a Hidden Markov Model with states the stages of the underlying process, thus allowing us to handle complex and heterogeneous sequences. The approach was developed as part of the EU-IST MISSION project. It is published in a special issue of the AI Review that includes data mining methods for bioinformatics. The concepts are being used and further developed in the EPSRC and HPSSNI-funded RIGHT project, which is developing model based clustering algorithms for patient pathway sequences.

PY - 2003/10/1

Y1 - 2003/10/1

N2 - We are concerned with clustering and characterising gene expression sequences that have been classified according to heterogeneous classification schemes. We adopt a model-based approach that uses a Hidden Markov Model (HMM) that has as states the stages of the underlying process that generates the gene sequences, thus allowing us to handle complex and heterogeneous data. Each cluster is described in terms of a HMM where we seek to find schema mappings between the states of the original sequences and the states of the HMM.The general solution that we propose involves several distinct tasks. Firstly, there is a clustering problem where we seek to group similar sequences; for this we use mutual entropy to identify associations between sequence states. Secondly, because we are concerned with clustering heterogeneous sequences, we must determine the mappings between the states of each sequence in a cluster and the states of an underlying hidden process; for this we compute the most probable mapping. Thirdly, using these mappings we employ maximum likelihood techniques to learn the probabilistic description of the hidden Markov process for each cluster. Fourthly, we use these descriptions to characterise the clusters using Dynamic Programming to determine the most probable pathway for each cluster. Finally, we derive linguistic labels to describe the clusters in a user-friendly manner. Such an approach provides an intuitive way of describing the underlying shape of the process by explicitly modelling the temporal aspects of the data. Non time-homogeneous HMMs are used to capture the full temporal semantics.

AB - We are concerned with clustering and characterising gene expression sequences that have been classified according to heterogeneous classification schemes. We adopt a model-based approach that uses a Hidden Markov Model (HMM) that has as states the stages of the underlying process that generates the gene sequences, thus allowing us to handle complex and heterogeneous data. Each cluster is described in terms of a HMM where we seek to find schema mappings between the states of the original sequences and the states of the HMM.The general solution that we propose involves several distinct tasks. Firstly, there is a clustering problem where we seek to group similar sequences; for this we use mutual entropy to identify associations between sequence states. Secondly, because we are concerned with clustering heterogeneous sequences, we must determine the mappings between the states of each sequence in a cluster and the states of an underlying hidden process; for this we compute the most probable mapping. Thirdly, using these mappings we employ maximum likelihood techniques to learn the probabilistic description of the hidden Markov process for each cluster. Fourthly, we use these descriptions to characterise the clusters using Dynamic Programming to determine the most probable pathway for each cluster. Finally, we derive linguistic labels to describe the clusters in a user-friendly manner. Such an approach provides an intuitive way of describing the underlying shape of the process by explicitly modelling the temporal aspects of the data. Non time-homogeneous HMMs are used to capture the full temporal semantics.

U2 - 10.1023/A:1026036631075

DO - 10.1023/A:1026036631075

M3 - Article

VL - 20

SP - 53

EP - 73

JO - Artificial Intelligence Review

T2 - Artificial Intelligence Review

JF - Artificial Intelligence Review

SN - 0269-2821

IS - 1-2

ER -