Graph-based term weighting for information retrieval

被引:119
作者
Blanco, Roi [2 ]
Lioma, Christina [1 ]
机构
[1] Univ Stuttgart, Dept Comp Sci, Stuttgart, Germany
[2] Univ A Coruna, Dept Comp Sci, La Coruna, Spain
来源
INFORMATION RETRIEVAL | 2012年 / 15卷 / 01期
关键词
Information retrieval; Graph theory; Natural language processing; SMALL-WORLD; COMMUNITY STRUCTURE; NETWORK; EVOLUTION; CONNECTIVITY; ORGANIZATION; CENTRALITY;
D O I
10.1007/s10791-011-9172-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A standard approach to Information Retrieval (IR) is to model text as a bag of words. Alternatively, text can be modelled as a graph, whose vertices represent words, and whose edges represent relations between the words, defined on the basis of any meaningful statistical or linguistic relation. Given such a text graph, graph theoretic computations can be applied to measure various properties of the graph, and hence of the text. This work explores the usefulness of such graph-based text representations for IR. Specifically, we propose a principled graph-theoretic approach of (1) computing term weights and (2) integrating discourse aspects into retrieval. Given a text graph, whose vertices denote terms linked by co-occurrence and grammatical modification, we use graph ranking computations (e.g. PageRank Page et al. in The pagerank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998) to derive weights for each vertex, i.e. term weights, which we use to rank documents against queries. We reason that our graph-based term weights do not necessarily need to be normalised by document length (unlike existing term weights) because they are already scaled by their graph-ranking computation. This is a departure from existing IR ranking functions, and we experimentally show that it performs comparably to a tuned ranking baseline, such as BM25 (Robertson et al. in NIST Special Publication 500-236: TREC-4, 1995). In addition, we integrate into ranking graph properties, such as the average path length, or clustering coefficient, which represent different aspects of the topology of the graph, and by extension of the document represented as a graph. Integrating such properties into ranking allows us to consider issues such as discourse coherence, flow and density during retrieval. We experimentally show that this type of ranking performs comparably to BM25, and can even outperform it, across different TREC (Voorhees and Harman in TREC: Experiment and evaluation in information retrieval, MIT Press, 2005) datasets and evaluation measures.
引用
收藏
页码:54 / 92
页数:39
相关论文
共 143 条
[51]  
Gamon M., 2006, Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, P17
[52]   Mapping the forms of meaning in small worlds [J].
Gaume, Bruno .
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2008, 23 (07) :848-862
[53]   Community structure in social and biological networks [J].
Girvan, M ;
Newman, MEJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (12) :7821-7826
[54]  
GOLDBERG A.B., 2006, P HLT NAACL WORKSHOP, P45
[55]   The worldwide air transportation network:: Anomalous centrality, community structure, and cities' global roles [J].
Guimerá, R ;
Mossa, S ;
Turtschi, A ;
Amaral, LAN .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (22) :7794-7799
[56]  
Halliday Matthew A. K., 1985, LANGUAGE CONTEXT TEX
[57]  
HARMAN D, 1991, J AM SOC INFORM SCI, V42, P7, DOI 10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO
[58]  
2-P
[59]  
Hassan S., 2006, P TEXTGRAPS 2 WORKSH, P53
[60]  
HO ND, 2004, P RES INF VIETN FRAN, P193