Sentence similarity based on semantic nets and corpus statistics

Yuhua Li, David McLean, Zuhair Bandar, James D. O’Shea, Keeley Crockett

    Research output: Contribution to journalArticle

    512 Citations (Scopus)

    Abstract

    Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.
    LanguageEnglish
    Pages1138-1150
    JournalIEEE Transactions on Knowledge and Data Engineering
    Volume18
    Issue number8
    Publication statusPublished - Dec 2006

    Fingerprint

    Semantics
    Statistics
    Knowledge representation
    Data mining
    Websites
    Experiments

    Cite this

    Li, Y., McLean, D., Bandar, Z., O’Shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138-1150.
    Li, Yuhua ; McLean, David ; Bandar, Zuhair ; O’Shea, James D. ; Crockett, Keeley. / Sentence similarity based on semantic nets and corpus statistics. In: IEEE Transactions on Knowledge and Data Engineering. 2006 ; Vol. 18, No. 8. pp. 1138-1150.
    @article{48523ebd180a4dd2b18f02d485ac2e6d,
    title = "Sentence similarity based on semantic nets and corpus statistics",
    abstract = "Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.",
    author = "Yuhua Li and David McLean and Zuhair Bandar and O’Shea, {James D.} and Keeley Crockett",
    year = "2006",
    month = "12",
    language = "English",
    volume = "18",
    pages = "1138--1150",
    journal = "IEEE Transactions on Knowledge and Data Engineering",
    issn = "1041-4347",
    number = "8",

    }

    Li, Y, McLean, D, Bandar, Z, O’Shea, JD & Crockett, K 2006, 'Sentence similarity based on semantic nets and corpus statistics', IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138-1150.

    Sentence similarity based on semantic nets and corpus statistics. / Li, Yuhua; McLean, David; Bandar, Zuhair; O’Shea, James D.; Crockett, Keeley.

    In: IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 8, 12.2006, p. 1138-1150.

    Research output: Contribution to journalArticle

    TY - JOUR

    T1 - Sentence similarity based on semantic nets and corpus statistics

    AU - Li, Yuhua

    AU - McLean, David

    AU - Bandar, Zuhair

    AU - O’Shea, James D.

    AU - Crockett, Keeley

    PY - 2006/12

    Y1 - 2006/12

    N2 - Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.

    AB - Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.

    UR - http://computer.org/tkde/

    M3 - Article

    VL - 18

    SP - 1138

    EP - 1150

    JO - IEEE Transactions on Knowledge and Data Engineering

    T2 - IEEE Transactions on Knowledge and Data Engineering

    JF - IEEE Transactions on Knowledge and Data Engineering

    SN - 1041-4347

    IS - 8

    ER -

    Li Y, McLean D, Bandar Z, O’Shea JD, Crockett K. Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering. 2006 Dec;18(8):1138-1150.