Finding scientific topics

被引:3620
作者
Griffiths, TL
Steyvers, M
机构
[1] Univ Calif Irvine, Dept Cognit Sci, Irvine, CA 92697 USA
[2] MIT, Dept Brain & Cognit Sci, Cambridge, MA 02139 USA
[3] Stanford Univ, Dept Psychol, Stanford, CA 94305 USA
关键词
D O I
10.1073/pnas.0307752101
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.
引用
收藏
页码:5228 / 5235
页数:8
相关论文
共 18 条
[1]   Combined models for topic spotting and topic-dependent Language Modeling [J].
Bigi, B ;
De Mori, R ;
El-Beze, M ;
Spriet, T .
1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, :535-542
[2]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[3]  
Cohn D, 2001, ADV NEUR IN, V13, P430
[4]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[5]  
EROSHEVA EA, 2003, BAYESIAN STAT, V7
[7]   STOCHASTIC RELAXATION, GIBBS DISTRIBUTIONS, AND THE BAYESIAN RESTORATION OF IMAGES [J].
GEMAN, S ;
GEMAN, D .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1984, 6 (06) :721-741
[8]   Unsupervised learning by probabilistic latent semantic analysis [J].
Hofmann, T .
MACHINE LEARNING, 2001, 42 (1-2) :177-196
[9]  
IYER R, 1996, P INT C SPOK LANG PR, V1, P236
[10]   BAYES FACTORS [J].
KASS, RE ;
RAFTERY, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1995, 90 (430) :773-795