Abstract
A user query for information retrieval (IR) applications may not contain the most appropriate terms (words) as actually intended by the user. This is usually referred to as the term mismatch problem and is a crucial research issue in IR. Using the notion of relevance, we provide a comprehensive theoretical analysis of a parametric query vector, which is assumed to represent the information needs of the user. A lexical association function has been derived analytically using the system relevance criteria. The derivation is further justified using an empirical evidence from the user relevance criteria. Such analytical derivation as presented in this paper provides a proper mathematical framework to the query expansion techniques, which have largely been heuristic in the existing literature. By using the generalized retrieval framework, the proposed query representation model is equally applicable to the vector space model (VSM), Okapi best matching 25 (Okapi BM25) and Language Model (LM). Experiments over various datasets from TREC show that the proposed query representation gives statistically significant improvements over the baseline Okapi BM25 and LM as well as other well known global query expansion techniques. Empirical results along with the theoretical foundations of the query representation confirm that the proposed model extends the state-of-the-art in global query expansion.
Original language | English |
---|---|
Pages (from-to) | 2260-2273 |
Journal | IEEE Transactions on Knowledge and Data Engineering |
Volume | 24 |
Issue number | 12 |
DOIs | |
Publication status | Published (in print/issue) - Dec 2012 |
Bibliographical note
Reference text: [1] G. Salton, A. Wong, and C. S. Yang, “A vector space model forautomatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975.
[2] S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter, “Probabilistic
models of indexing and searching,” in SIGIR ’80: Proceedings of the 3rd
annual ACM conference on Research and development in information
retrieval. Kent, UK, UK: Butterworth & Co., 1981, pp. 35–56.
[3] H. Turtle and W. B. Croft, “Inference networks for document retrieval,”
in Proceedings of the 13th annual international ACM SIGIR conference
on Research and development in information retrieval, ser. SIGIR ’90.
New York, NY, USA: ACM, 1990, pp. 1–24.
[4] T. Kalt, “A new probabilistic model of text classification and retrieval
title2:,” Amherst, MA, USA, Tech. Rep., 1998.
[5] J. M. Ponte and W. B. Croft, “A language modeling approach to
information retrieval,” in SIGIR ’98: Proceedings of the 21st annual
international ACM SIGIR conference on Research and development in
information retrieval. New York, NY, USA: ACM, 1998, pp. 275–281.
[6] B.-H. Cho, C. Lee, and G. G. Lee, “Exploring term dependences in
probabilistic information retrieval model,” Inf. Process. Manage., vol. 39,
no. 4, pp. 505–519, 2003.
[7] D. Downey, S. Dumais, D. Liebling, and E. Horvitz, “Understanding
the relationship between searchers’ queries and information goals,” in
CIKM ’08: Proceeding of the 17th ACM conference on Information and
knowledge management. New York, NY, USA: ACM, 2008, pp. 449–
458.
[8] B. J. Jansen, A. Spink, and T. Saracevic, “Real life, real users, and real
needs: a study and analysis of user queries on the web,” Inf. Process.
Manage., vol. 36, no. 2, pp. 207–227, 2000.
[9] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, “The
vocabulary problem in human-system communication,” Commun. ACM,
vol. 30, no. 11, pp. 964–971, 1987.
[10] T. Custis and K. Al-Kofahi, “A new approach for evaluating query
expansion: query-document term mismatch,” in SIGIR ’07: Proceedings
of the 30th annual international ACM SIGIR conference on Research
and development in information retrieval. New York, NY, USA: ACM,
2007, pp. 575–582.
[11] S.-H. Na, I.-S. Kang, J.-E. Roh, and J.-H. Lee, “An empirical study
of query expansion and cluster-based retrieval in language modeling
approach,” Inf. Process. Manage., vol. 43, no. 2, pp. 302–314, 2007.
[12] J. Rocchio, Relevance Feedback in Information Retrieval, 1971, pp. 313–
323.
[13] I. Ruthven, “Re-examining the potential effectiveness of interactive
query expansion,” in SIGIR ’03: Proceedings of the 26th annual international
ACM SIGIR conference on Research and development in
informaion retrieval. New York, NY, USA: ACM, 2003, pp. 213–220.
[14] P. Anick, “Using terminological feedback for web search refinement:
a log-based study,” in SIGIR ’03: Proceedings of the 26th annual
international ACM SIGIR conference on Research and development in
informaion retrieval. New York, NY, USA: ACM, 2003, pp. 88–95.
[15] J. Xu and W. B. Croft, “Improving the effectiveness of information
retrieval with local context analysis,” ACM Trans. Inf. Syst., vol. 18,
no. 1, pp. 79–112, 2000.
[16] M. Okabe, K. Umemura, and S. Yamada, “Query expansion with
the minimum user feedback by transductive learning,” in HLT ’05:
Proceedings of the conference on Human Language Technology and
Empirical Methods in Natural Language Processing. Morristown, NJ,
USA: Association for Computational Linguistics, 2005, pp. 963–970.
[17] V. Lavrenko and W. B. Croft, “Relevance based language models,” in
SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR
conference on Research and development in information retrieval. New
York, NY, USA: ACM, 2001, pp. 120–127.
[18] C. Zhai and J. Lafferty, “Model-based feedback in the language modeling
approach to information retrieval,” in CIKM ’01: Proceedings
of the tenth international conference on Information and knowledge
management. New York, NY, USA: ACM, 2001, pp. 403–410.
[19] S. P. Harter, “Psychological relevance and information science,” J. Am.
Soc. Inf. Sci., vol. 43, no. 9, pp. 602–615, 1992.
[20] T. Saracevic, “Saracevic, T. (1996). Relevance reconsidered. Information
science: Integration in perspectives.” Proceedings of the Second Conference
on Conceptions of Library and Information Science, Copenhagen,
Denmark, pp. 201–218, 1996.
[21] A. Schutz and R. Zaner, Reflections on the Problem of Relevance. Yale
University Press, 1970.
[22] T. Saracevic, “Relevance: A review of the literature and a framework
for thinking on the notion in information science. Part II: nature
and manifestations of relevance,” Journal of the American Society for
Information Science and Technology, vol. 58, no. 13, pp. 1915–1933,
2007.
[23] D. Swanson, “Subjective versus objective relevance in bibliographic
retrieval systems,” The Library Quarterly, vol. 56, no. 4, pp. 389–398,
1986.
[24] T. Park, “The nature of relevance in information retrieval: An empirical
study,” The library quarterly, pp. 318–351, 1993.
[25] W. Cooper, “A definition of relevance for information retrieval* 1,”
Information storage and retrieval, vol. 7, no. 1, pp. 19–37, 1971.
[26] P. Wilson, “Situational relevance,” Information storage and retrieval,
vol. 9, no. 8, pp. 457–471, 1973.
[27] C. L. Barry, “Document representations and clues to document relevance,”
J. Am. Soc. Inf. Sci., vol. 49, pp. 1293–1303, December 1998.
[28] P. Wang, M. W. Berry, and Y. Yang, “Mining longitudinal web queries:
trends and patterns,” J. Am. Soc. Inf. Sci. Technol., vol. 54, no. 8, pp.
743–758, 2003.
[29] C. J. Crouch, “An approach to the automatic construction of global
thesauri,” Inf. Process. Manage., vol. 26, no. 5, pp. 629–640, 1990.
[30] Y. Qiu and H.-P. Frei, “Concept based query expansion,” in SIGIR ’93:
Proceedings of the 16th annual international ACM SIGIR conference
on Research and development in information retrieval. New York, NY,
USA: ACM, 1993, pp. 160–169.
[31] J. Lafferty and C. Zhai, “Document language models, query models, and
risk minimization for information retrieval,” in SIGIR ’01: Proceedings
of the 24th annual international ACM SIGIR conference on Research
and development in information retrieval. New York, NY, USA: ACM,
2001, pp. 111–119.
[32] J. Bai and J.-Y. Nie, “Adapting information retrieval to query contexts,”
Inf. Process. Manage., vol. 44, no. 6, pp. 1901–1922, 2008.
[33] H. Sch¨utze and J. O. Pedersen, “A cooccurrence-based thesaurus and
two applications to information retrieval,” Inf. Process. Manage., vol. 33,
no. 3, pp. 307–318, 1997.
[34] D. Song and P. D. Bruza, “Towards context sensitive information
inference,” J. Am. Soc. Inf. Sci. Technol., vol. 54, no. 4, pp. 321–334,
2003.
[35] J. Bai, D. Song, P. Bruza, J.-Y. Nie, and G. Cao, “Query expansion
using term relationships in language models for information retrieval,”
in CIKM ’05: Proceedings of the 14th ACM international conference on
Information and knowledge management. New York, NY, USA: ACM,
2005, pp. 688–695.
[36] L. A. F. Park and K. Ramamohanarao, “An analysis of latent semantic
term self-correlation,” ACM Trans. Inf. Syst., vol. 27, no. 2, pp. 1–35,
2009.
[37] E. M. Voorhees, “Query expansion using lexical-semantic relations,” in
SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR
conference on Research and development in information retrieval. New
York, NY, USA: Springer-Verlag New York, Inc., 1994, pp. 61–69.
[38] G. Salton, Automatic Information Organization and Retrieval. McGraw
Hill Text, 1968.
[39] G. Cao, J.-Y. Nie, and J. Bai, “Integrating word relationships into
language models,” in SIGIR ’05: Proceedings of the 28th annual
international ACM SIGIR conference on Research and development in
information retrieval. New York, NY, USA: ACM, 2005, pp. 298–305.
[40] M.-H. Hsu, M.-F. Tsai, and H.-H. Chen, “Combining wordnet and
conceptnet for automatic query expansion: a learning approach,” in
AIRS’08: Proceedings of the 4th Asia information retrieval conference
on Information retrieval technology. Berlin, Heidelberg: Springer-
Verlag, 2008, pp. 213–224.
[41] J. Zhang, B. Deng, and X. Li, “Concept based query expansion using
wordnet,” in AST ’09: Proceedings of the 2009 International e-
Conference on Advanced Science and Technology. Washington, DC,
USA: IEEE Computer Society, 2009, pp. 52–55.
[42] F. J. Pinto, A. F. Martinez, and C. F. Perez-Sanjulian, “Joining automatic
query expansion based on thesaurus and word sense disambiguation
using wordnet,” Int. J. Comput. Appl. Technol., vol. 33, no. 4, pp. 271–
279, 2009.
[43] J. Bhogal, A. Macfarlane, and P. Smith, “A review of ontology based
query expansion,” Inf. Process. Manage., vol. 43, no. 4, pp. 866–886,
2007.
[44] C. Buckley, G. Salton, J. Allan, and A. Singhal, “Automatic query
expansion using smart: Trec 3,” in TREC, 1994.
[45] C. Carpineto, R. de Mori, G. Romano, and B. Bigi, “An informationtheoretic
approach to automatic query expansion,” ACM Trans. Inf. Syst.,
vol. 19, no. 1, pp. 1–27, 2001.
[46] D. Metzler and W. B. Croft, “Latent concept expansion using markov
random fields,” in SIGIR, 2007, pp. 311–318.
[47] K. Collins-Thompson and J. Callan, “Estimation and use of uncertainty
in pseudo-relevance feedback,” in SIGIR ’07: Proceedings of the 30th
annual international ACM SIGIR conference on Research and development
in information retrieval. New York, NY, USA: ACM, 2007, pp.
303–310.
[48] G. Cao, J.-Y. Nie, J. Gao, and S. Robertson, “Selecting good expansion
terms for pseudo-relevance feedback,” in SIGIR ’08: Proceedings of
the 31st annual international ACM SIGIR conference on Research and
development in information retrieval. New York, NY, USA: ACM,
2008, pp. 243–250.
[49] S. K. M. Wong, W. Ziarko, and P. C. N. Wong, “Generalized vector
spaces model in information retrieval,” in SIGIR ’85: Proceedings of
the 8th annual international ACM SIGIR conference on Research and
development in information retrieval. New York, NY, USA: ACM,
1985, pp. 18–25.
[50] J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson Correlation
Coefficient,” Noise Reduction in Speech Processing, pp. 1–4, 2009.
[51] E. M. Voorhees and D. K. Harman, “Overview of the sixth text
retrieval conference (trec-6),” in Proceedings of the Sixth Text REtrieval
Conference (TREC-6), 1998, pp. 83–91.
[52] D. Wollersheim and J. Rahayu, “Ontology based query expansion framework
for use in medical information systems,” International Journal of
Web Information Systems, vol. 1, no. 2, pp. 101–115, 2005.
[53] R. Navigli and P. Velardi, “An analysis of ontology-based query expansion
strategies,” in Workshop on Adaptive Text Extraction and Mining.
Citeseer, 2003, pp. 42–49.
[54] M. Song, I.-Y. Song, X. Hu, and R. B. Allen, “Integration of association
rules and ontologies for semantic query expansion,” Data Knowl. Eng.,
vol. 63, pp. 63–75, October 2007.
[55] K. S. Jones, S. Walker, and S. E. Robertson, “A probabilistic model of
information retrieval: development and comparative experiments,” Inf.
Process. Manage., vol. 36, no. 6, pp. 779–808, 2000.
[56] C. Zhai and J. Lafferty, “A study of smoothing methods for language
models applied to information retrieval,” ACM Trans. Inf. Syst., vol. 22,
no. 2, pp. 179–214, 2004.
[57] D. J. C. Mackay and L. Peto, “A hierarchical dirichlet language model,”
Natural Language Engineering, vol. 1, no. 3, pp. 1–19, 1994.
[58] F. Jelinek and R. Mercer, “Interpolated estimation of markov source
parameters from sparse data,” Pattern Recognition in Practice, pp. 381–
402, 1980.
Keywords
- Information Retrieval
- Lexical Association
- Query Expansion
- Language Model