Prediction of the bonding state of cysteine residues in proteins with machine-learning methods

Castrense Savojardo, Piero Fariselli, Pier Luigi Martelli, Priyank Shukla, Rita Casadio

    Research output: Chapter in Book/Report/Conference proceedingChapter

    4 Citations (Scopus)

    Abstract

    In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches: 1) Hidden Support Vector Machines (HSVMs) which integrate the SVM predictions with a Hidden Markov Model; 2) SVM-HMMs which discriminatively train models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) that we recently introduced. We evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program and we show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Among the different methods it appears that GRHCRFs perform slightly better than the others achieving a per protein accuracy of 87% with a Matthews correlation coefficient (C) of 0.73. Finally, we investigate the difference between disulfide bonding state predictions in Eukaryotes and Prokaryotes. Our analysis shows that the per-protein accuracy in Prokaryotic proteins is higher than that in Eukaryotes (0.88 vs 0.83). However, given the paucity of bonded cysteines in Prokaryotes as compared to Eukaryotes the Matthews correlation coefficient is drastically reduced (0.48 vs 0.80).
    LanguageEnglish
    Title of host publicationComputational Intelligence Methods for Bioinformatics and Biostatistics
    EditorsRiccardo Rizzo, Paulo J. G. Lisboa
    Place of PublicationBerlin
    Pages98-111
    Volume6685
    DOIs
    Publication statusPublished - Apr 2011

    Fingerprint

    Learning systems
    Proteins
    Hidden Markov models
    Support vector machines

    Keywords

    • Hidden Markov Models (HMM)
    • Support Vector Machines (SVM)
    • Conditional Random Fields (CRF)
    • Cysteine Bonding State
    • Protein Structure Prediction.

    Cite this

    Savojardo, C., Fariselli, P., Martelli, P. L., Shukla, P., & Casadio, R. (2011). Prediction of the bonding state of cysteine residues in proteins with machine-learning methods. In R. Rizzo, & P. J. G. Lisboa (Eds.), Computational Intelligence Methods for Bioinformatics and Biostatistics (Vol. 6685, pp. 98-111). Berlin. https://doi.org/10.1007/978-3-642-21946-7
    Savojardo, Castrense ; Fariselli, Piero ; Martelli, Pier Luigi ; Shukla, Priyank ; Casadio, Rita. / Prediction of the bonding state of cysteine residues in proteins with machine-learning methods. Computational Intelligence Methods for Bioinformatics and Biostatistics. editor / Riccardo Rizzo ; Paulo J. G. Lisboa. Vol. 6685 Berlin, 2011. pp. 98-111
    @inbook{eedbcc5556c340f8b8968f1678803fbc,
    title = "Prediction of the bonding state of cysteine residues in proteins with machine-learning methods",
    abstract = "In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches: 1) Hidden Support Vector Machines (HSVMs) which integrate the SVM predictions with a Hidden Markov Model; 2) SVM-HMMs which discriminatively train models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) that we recently introduced. We evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program and we show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Among the different methods it appears that GRHCRFs perform slightly better than the others achieving a per protein accuracy of 87{\%} with a Matthews correlation coefficient (C) of 0.73. Finally, we investigate the difference between disulfide bonding state predictions in Eukaryotes and Prokaryotes. Our analysis shows that the per-protein accuracy in Prokaryotic proteins is higher than that in Eukaryotes (0.88 vs 0.83). However, given the paucity of bonded cysteines in Prokaryotes as compared to Eukaryotes the Matthews correlation coefficient is drastically reduced (0.48 vs 0.80).",
    keywords = "Hidden Markov Models (HMM), Support Vector Machines (SVM), Conditional Random Fields (CRF), Cysteine Bonding State, Protein Structure Prediction.",
    author = "Castrense Savojardo and Piero Fariselli and Martelli, {Pier Luigi} and Priyank Shukla and Rita Casadio",
    year = "2011",
    month = "4",
    doi = "10.1007/978-3-642-21946-7",
    language = "English",
    isbn = "978-3-642-21945-0",
    volume = "6685",
    pages = "98--111",
    editor = "Riccardo Rizzo and Lisboa, {Paulo J. G.}",
    booktitle = "Computational Intelligence Methods for Bioinformatics and Biostatistics",

    }

    Savojardo, C, Fariselli, P, Martelli, PL, Shukla, P & Casadio, R 2011, Prediction of the bonding state of cysteine residues in proteins with machine-learning methods. in R Rizzo & PJG Lisboa (eds), Computational Intelligence Methods for Bioinformatics and Biostatistics. vol. 6685, Berlin, pp. 98-111. https://doi.org/10.1007/978-3-642-21946-7

    Prediction of the bonding state of cysteine residues in proteins with machine-learning methods. / Savojardo, Castrense; Fariselli, Piero; Martelli, Pier Luigi; Shukla, Priyank; Casadio, Rita.

    Computational Intelligence Methods for Bioinformatics and Biostatistics. ed. / Riccardo Rizzo; Paulo J. G. Lisboa. Vol. 6685 Berlin, 2011. p. 98-111.

    Research output: Chapter in Book/Report/Conference proceedingChapter

    TY - CHAP

    T1 - Prediction of the bonding state of cysteine residues in proteins with machine-learning methods

    AU - Savojardo, Castrense

    AU - Fariselli, Piero

    AU - Martelli, Pier Luigi

    AU - Shukla, Priyank

    AU - Casadio, Rita

    PY - 2011/4

    Y1 - 2011/4

    N2 - In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches: 1) Hidden Support Vector Machines (HSVMs) which integrate the SVM predictions with a Hidden Markov Model; 2) SVM-HMMs which discriminatively train models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) that we recently introduced. We evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program and we show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Among the different methods it appears that GRHCRFs perform slightly better than the others achieving a per protein accuracy of 87% with a Matthews correlation coefficient (C) of 0.73. Finally, we investigate the difference between disulfide bonding state predictions in Eukaryotes and Prokaryotes. Our analysis shows that the per-protein accuracy in Prokaryotic proteins is higher than that in Eukaryotes (0.88 vs 0.83). However, given the paucity of bonded cysteines in Prokaryotes as compared to Eukaryotes the Matthews correlation coefficient is drastically reduced (0.48 vs 0.80).

    AB - In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches: 1) Hidden Support Vector Machines (HSVMs) which integrate the SVM predictions with a Hidden Markov Model; 2) SVM-HMMs which discriminatively train models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) that we recently introduced. We evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSI-BLAST program and we show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Among the different methods it appears that GRHCRFs perform slightly better than the others achieving a per protein accuracy of 87% with a Matthews correlation coefficient (C) of 0.73. Finally, we investigate the difference between disulfide bonding state predictions in Eukaryotes and Prokaryotes. Our analysis shows that the per-protein accuracy in Prokaryotic proteins is higher than that in Eukaryotes (0.88 vs 0.83). However, given the paucity of bonded cysteines in Prokaryotes as compared to Eukaryotes the Matthews correlation coefficient is drastically reduced (0.48 vs 0.80).

    KW - Hidden Markov Models (HMM)

    KW - Support Vector Machines (SVM)

    KW - Conditional Random Fields (CRF)

    KW - Cysteine Bonding State

    KW - Protein Structure Prediction.

    U2 - 10.1007/978-3-642-21946-7

    DO - 10.1007/978-3-642-21946-7

    M3 - Chapter

    SN - 978-3-642-21945-0

    VL - 6685

    SP - 98

    EP - 111

    BT - Computational Intelligence Methods for Bioinformatics and Biostatistics

    A2 - Rizzo, Riccardo

    A2 - Lisboa, Paulo J. G.

    CY - Berlin

    ER -

    Savojardo C, Fariselli P, Martelli PL, Shukla P, Casadio R. Prediction of the bonding state of cysteine residues in proteins with machine-learning methods. In Rizzo R, Lisboa PJG, editors, Computational Intelligence Methods for Bioinformatics and Biostatistics. Vol. 6685. Berlin. 2011. p. 98-111 https://doi.org/10.1007/978-3-642-21946-7