Graph-based term weighting for information retrieval

被引:119
作者
Blanco, Roi [2 ]
Lioma, Christina [1 ]
机构
[1] Univ Stuttgart, Dept Comp Sci, Stuttgart, Germany
[2] Univ A Coruna, Dept Comp Sci, La Coruna, Spain
来源
INFORMATION RETRIEVAL | 2012年 / 15卷 / 01期
关键词
Information retrieval; Graph theory; Natural language processing; SMALL-WORLD; COMMUNITY STRUCTURE; NETWORK; EVOLUTION; CONNECTIVITY; ORGANIZATION; CENTRALITY;
D O I
10.1007/s10791-011-9172-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A standard approach to Information Retrieval (IR) is to model text as a bag of words. Alternatively, text can be modelled as a graph, whose vertices represent words, and whose edges represent relations between the words, defined on the basis of any meaningful statistical or linguistic relation. Given such a text graph, graph theoretic computations can be applied to measure various properties of the graph, and hence of the text. This work explores the usefulness of such graph-based text representations for IR. Specifically, we propose a principled graph-theoretic approach of (1) computing term weights and (2) integrating discourse aspects into retrieval. Given a text graph, whose vertices denote terms linked by co-occurrence and grammatical modification, we use graph ranking computations (e.g. PageRank Page et al. in The pagerank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998) to derive weights for each vertex, i.e. term weights, which we use to rank documents against queries. We reason that our graph-based term weights do not necessarily need to be normalised by document length (unlike existing term weights) because they are already scaled by their graph-ranking computation. This is a departure from existing IR ranking functions, and we experimentally show that it performs comparably to a tuned ranking baseline, such as BM25 (Robertson et al. in NIST Special Publication 500-236: TREC-4, 1995). In addition, we integrate into ranking graph properties, such as the average path length, or clustering coefficient, which represent different aspects of the topology of the graph, and by extension of the document represented as a graph. Integrating such properties into ranking allows us to consider issues such as discourse coherence, flow and density during retrieval. We experimentally show that this type of ranking performs comparably to BM25, and can even outperform it, across different TREC (Voorhees and Harman in TREC: Experiment and evaluation in information retrieval, MIT Press, 2005) datasets and evaluation measures.
引用
收藏
页码:54 / 92
页数:39
相关论文
共 143 条
[1]   Scale-free networks in cell biology [J].
Albert, R .
JOURNAL OF CELL SCIENCE, 2005, 118 (21) :4947-4957
[2]   Statistical mechanics of complex networks [J].
Albert, R ;
Barabási, AL .
REVIEWS OF MODERN PHYSICS, 2002, 74 (01) :47-97
[3]  
ALBERT R, 1999, CONDMAT9907038 CORR
[4]  
Allan J., 2009, P 32 ANN INT ACM SIG
[5]  
Anh V. N., 2005, SIGIR 2005. Proceedings of the Twenty-Eighth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P226, DOI 10.1145/1076034.1076075
[6]  
[Anonymous], 2001, CONDMAT0106096 CORR
[7]  
[Anonymous], 1965, The Structure of Association in Language and Thought
[8]  
[Anonymous], 2009, Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics, DOI DOI 10.3115/1609067.1609070
[9]  
[Anonymous], 2001, IEEE Data Eng. Bull.
[10]  
[Anonymous], 1998, SIGIR 98 P 21 ANN IN, DOI DOI 10.1145/290941.291008