Finding scientific topics

被引：3620

作者：

Griffiths, TL

Steyvers, M

机构：

[1] Univ Calif Irvine, Dept Cognit Sci, Irvine, CA 92697 USA

[2] MIT, Dept Brain & Cognit Sci, Cambridge, MA 02139 USA

[3] Stanford Univ, Dept Psychol, Stanford, CA 94305 USA

来源：

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA | 2004年 / 101卷

关键词：

D O I：

10.1073/pnas.0307752101

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.

引用

页码：5228 / 5235

页数：8

共 18 条

[1] Combined models for topic spotting and topic-dependent Language Modeling [J].