Latent semantic indexing: A probabilistic analysis

被引:218
作者
Papadimitriou, CH [1 ]
Raghavan, P
Tamaki, H
Vempala, S
机构
[1] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA
[2] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA
[3] Meiji Univ, Dept Comp Sci, Ikuta, Japan
[4] MIT, Dept Math, Cambridge, MA 02139 USA
关键词
D O I
10.1006/jcss.2000.1711
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering. (C) 2000 Academic Press.
引用
收藏
页码:217 / 235
页数:19
相关论文
共 31 条
[1]  
[Anonymous], 1984, CONTEMP MATH
[2]   Using linear algebra for intelligent information retrieval [J].
Berry, MW ;
Dumais, ST ;
OBrien, GW .
SIAM REVIEW, 1995, 37 (04) :573-595
[3]  
BERRY MW, 1993, SVDPACKC VERSION 1 0
[4]  
BREWER E, 1997, 1997 PODS SIGMOD
[5]  
CHAKRABARTI S, 1997, MINING INFORMATION N
[6]  
CVETKOVIC DM, 1979, SPECTRA GRAPHS
[7]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[8]  
2-9
[9]  
DRINEAS P, 1999, IN PRESS P ACM SIAM
[10]   IMPROVING THE RETRIEVAL OF INFORMATION FROM EXTERNAL SOURCES [J].
DUMAIS, ST .
BEHAVIOR RESEARCH METHODS INSTRUMENTS & COMPUTERS, 1991, 23 (02) :229-236