Query Representation through Lexical Association for Information Retrieval

Pawan Goyal, Laxmidhar Behera, Martin McGinnity

    Research output: Contribution to journalArticlepeer-review

    5 Citations (Scopus)

    Abstract

    A user query for information retrieval (IR) applications may not contain the most appropriate terms (words) as actually intended by the user. This is usually referred to as the term mismatch problem and is a crucial research issue in IR. Using the notion of relevance, we provide a comprehensive theoretical analysis of a parametric query vector, which is assumed to represent the information needs of the user. A lexical association function has been derived analytically using the system relevance criteria. The derivation is further justified using an empirical evidence from the user relevance criteria. Such analytical derivation as presented in this paper provides a proper mathematical framework to the query expansion techniques, which have largely been heuristic in the existing literature. By using the generalized retrieval framework, the proposed query representation model is equally applicable to the vector space model (VSM), Okapi best matching 25 (Okapi BM25) and Language Model (LM). Experiments over various datasets from TREC show that the proposed query representation gives statistically significant improvements over the baseline Okapi BM25 and LM as well as other well known global query expansion techniques. Empirical results along with the theoretical foundations of the query representation confirm that the proposed model extends the state-of-the-art in global query expansion.
    Original languageEnglish
    Pages (from-to)2260-2273
    JournalIEEE Transactions on Knowledge and Data Engineering
    Volume24
    Issue number12
    DOIs
    Publication statusPublished (in print/issue) - Dec 2012

    Bibliographical note

    Reference text: [1] G. Salton, A. Wong, and C. S. Yang, “A vector space model for
    automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975.
    [2] S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter, “Probabilistic
    models of indexing and searching,” in SIGIR ’80: Proceedings of the 3rd
    annual ACM conference on Research and development in information
    retrieval. Kent, UK, UK: Butterworth & Co., 1981, pp. 35–56.
    [3] H. Turtle and W. B. Croft, “Inference networks for document retrieval,”
    in Proceedings of the 13th annual international ACM SIGIR conference
    on Research and development in information retrieval, ser. SIGIR ’90.
    New York, NY, USA: ACM, 1990, pp. 1–24.
    [4] T. Kalt, “A new probabilistic model of text classification and retrieval
    title2:,” Amherst, MA, USA, Tech. Rep., 1998.
    [5] J. M. Ponte and W. B. Croft, “A language modeling approach to
    information retrieval,” in SIGIR ’98: Proceedings of the 21st annual
    international ACM SIGIR conference on Research and development in
    information retrieval. New York, NY, USA: ACM, 1998, pp. 275–281.
    [6] B.-H. Cho, C. Lee, and G. G. Lee, “Exploring term dependences in
    probabilistic information retrieval model,” Inf. Process. Manage., vol. 39,
    no. 4, pp. 505–519, 2003.
    [7] D. Downey, S. Dumais, D. Liebling, and E. Horvitz, “Understanding
    the relationship between searchers’ queries and information goals,” in
    CIKM ’08: Proceeding of the 17th ACM conference on Information and
    knowledge management. New York, NY, USA: ACM, 2008, pp. 449–
    458.
    [8] B. J. Jansen, A. Spink, and T. Saracevic, “Real life, real users, and real
    needs: a study and analysis of user queries on the web,” Inf. Process.
    Manage., vol. 36, no. 2, pp. 207–227, 2000.
    [9] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, “The
    vocabulary problem in human-system communication,” Commun. ACM,
    vol. 30, no. 11, pp. 964–971, 1987.
    [10] T. Custis and K. Al-Kofahi, “A new approach for evaluating query
    expansion: query-document term mismatch,” in SIGIR ’07: Proceedings
    of the 30th annual international ACM SIGIR conference on Research
    and development in information retrieval. New York, NY, USA: ACM,
    2007, pp. 575–582.
    [11] S.-H. Na, I.-S. Kang, J.-E. Roh, and J.-H. Lee, “An empirical study
    of query expansion and cluster-based retrieval in language modeling
    approach,” Inf. Process. Manage., vol. 43, no. 2, pp. 302–314, 2007.
    [12] J. Rocchio, Relevance Feedback in Information Retrieval, 1971, pp. 313–
    323.
    [13] I. Ruthven, “Re-examining the potential effectiveness of interactive
    query expansion,” in SIGIR ’03: Proceedings of the 26th annual international
    ACM SIGIR conference on Research and development in
    informaion retrieval. New York, NY, USA: ACM, 2003, pp. 213–220.
    [14] P. Anick, “Using terminological feedback for web search refinement:
    a log-based study,” in SIGIR ’03: Proceedings of the 26th annual
    international ACM SIGIR conference on Research and development in
    informaion retrieval. New York, NY, USA: ACM, 2003, pp. 88–95.
    [15] J. Xu and W. B. Croft, “Improving the effectiveness of information
    retrieval with local context analysis,” ACM Trans. Inf. Syst., vol. 18,
    no. 1, pp. 79–112, 2000.
    [16] M. Okabe, K. Umemura, and S. Yamada, “Query expansion with
    the minimum user feedback by transductive learning,” in HLT ’05:
    Proceedings of the conference on Human Language Technology and
    Empirical Methods in Natural Language Processing. Morristown, NJ,
    USA: Association for Computational Linguistics, 2005, pp. 963–970.
    [17] V. Lavrenko and W. B. Croft, “Relevance based language models,” in
    SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR
    conference on Research and development in information retrieval. New
    York, NY, USA: ACM, 2001, pp. 120–127.
    [18] C. Zhai and J. Lafferty, “Model-based feedback in the language modeling
    approach to information retrieval,” in CIKM ’01: Proceedings
    of the tenth international conference on Information and knowledge
    management. New York, NY, USA: ACM, 2001, pp. 403–410.
    [19] S. P. Harter, “Psychological relevance and information science,” J. Am.
    Soc. Inf. Sci., vol. 43, no. 9, pp. 602–615, 1992.
    [20] T. Saracevic, “Saracevic, T. (1996). Relevance reconsidered. Information
    science: Integration in perspectives.” Proceedings of the Second Conference
    on Conceptions of Library and Information Science, Copenhagen,
    Denmark, pp. 201–218, 1996.
    [21] A. Schutz and R. Zaner, Reflections on the Problem of Relevance. Yale
    University Press, 1970.
    [22] T. Saracevic, “Relevance: A review of the literature and a framework
    for thinking on the notion in information science. Part II: nature
    and manifestations of relevance,” Journal of the American Society for
    Information Science and Technology, vol. 58, no. 13, pp. 1915–1933,
    2007.
    [23] D. Swanson, “Subjective versus objective relevance in bibliographic
    retrieval systems,” The Library Quarterly, vol. 56, no. 4, pp. 389–398,
    1986.
    [24] T. Park, “The nature of relevance in information retrieval: An empirical
    study,” The library quarterly, pp. 318–351, 1993.
    [25] W. Cooper, “A definition of relevance for information retrieval* 1,”
    Information storage and retrieval, vol. 7, no. 1, pp. 19–37, 1971.
    [26] P. Wilson, “Situational relevance,” Information storage and retrieval,
    vol. 9, no. 8, pp. 457–471, 1973.
    [27] C. L. Barry, “Document representations and clues to document relevance,”
    J. Am. Soc. Inf. Sci., vol. 49, pp. 1293–1303, December 1998.
    [28] P. Wang, M. W. Berry, and Y. Yang, “Mining longitudinal web queries:
    trends and patterns,” J. Am. Soc. Inf. Sci. Technol., vol. 54, no. 8, pp.
    743–758, 2003.
    [29] C. J. Crouch, “An approach to the automatic construction of global
    thesauri,” Inf. Process. Manage., vol. 26, no. 5, pp. 629–640, 1990.
    [30] Y. Qiu and H.-P. Frei, “Concept based query expansion,” in SIGIR ’93:
    Proceedings of the 16th annual international ACM SIGIR conference
    on Research and development in information retrieval. New York, NY,
    USA: ACM, 1993, pp. 160–169.
    [31] J. Lafferty and C. Zhai, “Document language models, query models, and
    risk minimization for information retrieval,” in SIGIR ’01: Proceedings
    of the 24th annual international ACM SIGIR conference on Research
    and development in information retrieval. New York, NY, USA: ACM,
    2001, pp. 111–119.
    [32] J. Bai and J.-Y. Nie, “Adapting information retrieval to query contexts,”
    Inf. Process. Manage., vol. 44, no. 6, pp. 1901–1922, 2008.
    [33] H. Sch¨utze and J. O. Pedersen, “A cooccurrence-based thesaurus and
    two applications to information retrieval,” Inf. Process. Manage., vol. 33,
    no. 3, pp. 307–318, 1997.
    [34] D. Song and P. D. Bruza, “Towards context sensitive information
    inference,” J. Am. Soc. Inf. Sci. Technol., vol. 54, no. 4, pp. 321–334,
    2003.
    [35] J. Bai, D. Song, P. Bruza, J.-Y. Nie, and G. Cao, “Query expansion
    using term relationships in language models for information retrieval,”
    in CIKM ’05: Proceedings of the 14th ACM international conference on
    Information and knowledge management. New York, NY, USA: ACM,
    2005, pp. 688–695.
    [36] L. A. F. Park and K. Ramamohanarao, “An analysis of latent semantic
    term self-correlation,” ACM Trans. Inf. Syst., vol. 27, no. 2, pp. 1–35,
    2009.
    [37] E. M. Voorhees, “Query expansion using lexical-semantic relations,” in
    SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR
    conference on Research and development in information retrieval. New
    York, NY, USA: Springer-Verlag New York, Inc., 1994, pp. 61–69.
    [38] G. Salton, Automatic Information Organization and Retrieval. McGraw
    Hill Text, 1968.
    [39] G. Cao, J.-Y. Nie, and J. Bai, “Integrating word relationships into
    language models,” in SIGIR ’05: Proceedings of the 28th annual
    international ACM SIGIR conference on Research and development in
    information retrieval. New York, NY, USA: ACM, 2005, pp. 298–305.
    [40] M.-H. Hsu, M.-F. Tsai, and H.-H. Chen, “Combining wordnet and
    conceptnet for automatic query expansion: a learning approach,” in
    AIRS’08: Proceedings of the 4th Asia information retrieval conference
    on Information retrieval technology. Berlin, Heidelberg: Springer-
    Verlag, 2008, pp. 213–224.
    [41] J. Zhang, B. Deng, and X. Li, “Concept based query expansion using
    wordnet,” in AST ’09: Proceedings of the 2009 International e-
    Conference on Advanced Science and Technology. Washington, DC,
    USA: IEEE Computer Society, 2009, pp. 52–55.
    [42] F. J. Pinto, A. F. Martinez, and C. F. Perez-Sanjulian, “Joining automatic
    query expansion based on thesaurus and word sense disambiguation
    using wordnet,” Int. J. Comput. Appl. Technol., vol. 33, no. 4, pp. 271–
    279, 2009.
    [43] J. Bhogal, A. Macfarlane, and P. Smith, “A review of ontology based
    query expansion,” Inf. Process. Manage., vol. 43, no. 4, pp. 866–886,
    2007.
    [44] C. Buckley, G. Salton, J. Allan, and A. Singhal, “Automatic query
    expansion using smart: Trec 3,” in TREC, 1994.
    [45] C. Carpineto, R. de Mori, G. Romano, and B. Bigi, “An informationtheoretic
    approach to automatic query expansion,” ACM Trans. Inf. Syst.,
    vol. 19, no. 1, pp. 1–27, 2001.
    [46] D. Metzler and W. B. Croft, “Latent concept expansion using markov
    random fields,” in SIGIR, 2007, pp. 311–318.
    [47] K. Collins-Thompson and J. Callan, “Estimation and use of uncertainty
    in pseudo-relevance feedback,” in SIGIR ’07: Proceedings of the 30th
    annual international ACM SIGIR conference on Research and development
    in information retrieval. New York, NY, USA: ACM, 2007, pp.
    303–310.
    [48] G. Cao, J.-Y. Nie, J. Gao, and S. Robertson, “Selecting good expansion
    terms for pseudo-relevance feedback,” in SIGIR ’08: Proceedings of
    the 31st annual international ACM SIGIR conference on Research and
    development in information retrieval. New York, NY, USA: ACM,
    2008, pp. 243–250.
    [49] S. K. M. Wong, W. Ziarko, and P. C. N. Wong, “Generalized vector
    spaces model in information retrieval,” in SIGIR ’85: Proceedings of
    the 8th annual international ACM SIGIR conference on Research and
    development in information retrieval. New York, NY, USA: ACM,
    1985, pp. 18–25.
    [50] J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson Correlation
    Coefficient,” Noise Reduction in Speech Processing, pp. 1–4, 2009.
    [51] E. M. Voorhees and D. K. Harman, “Overview of the sixth text
    retrieval conference (trec-6),” in Proceedings of the Sixth Text REtrieval
    Conference (TREC-6), 1998, pp. 83–91.
    [52] D. Wollersheim and J. Rahayu, “Ontology based query expansion framework
    for use in medical information systems,” International Journal of
    Web Information Systems, vol. 1, no. 2, pp. 101–115, 2005.
    [53] R. Navigli and P. Velardi, “An analysis of ontology-based query expansion
    strategies,” in Workshop on Adaptive Text Extraction and Mining.
    Citeseer, 2003, pp. 42–49.
    [54] M. Song, I.-Y. Song, X. Hu, and R. B. Allen, “Integration of association
    rules and ontologies for semantic query expansion,” Data Knowl. Eng.,
    vol. 63, pp. 63–75, October 2007.
    [55] K. S. Jones, S. Walker, and S. E. Robertson, “A probabilistic model of
    information retrieval: development and comparative experiments,” Inf.
    Process. Manage., vol. 36, no. 6, pp. 779–808, 2000.
    [56] C. Zhai and J. Lafferty, “A study of smoothing methods for language
    models applied to information retrieval,” ACM Trans. Inf. Syst., vol. 22,
    no. 2, pp. 179–214, 2004.
    [57] D. J. C. Mackay and L. Peto, “A hierarchical dirichlet language model,”
    Natural Language Engineering, vol. 1, no. 3, pp. 1–19, 1994.
    [58] F. Jelinek and R. Mercer, “Interpolated estimation of markov source
    parameters from sparse data,” Pattern Recognition in Practice, pp. 381–
    402, 1980.

    Keywords

    • Information Retrieval
    • Lexical Association
    • Query Expansion
    • Language Model

    Fingerprint

    Dive into the research topics of 'Query Representation through Lexical Association for Information Retrieval'. Together they form a unique fingerprint.

    Cite this