Interesting-Phrase Mining for Ad-Hoc Text Analytics

被引:13
作者
Bedathur, Srikanta [1 ]
Berberich, Klaus [1 ]
Dittrich, Jens [2 ]
Mamoulis, Nikos [1 ]
Weikum, Gerhard [1 ]
机构
[1] Max Planck Inst Informat, Saarbrucken, Germany
[2] Saarland Univ, Saarbrucken, Germany
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2010年 / 3卷 / 01期
关键词
D O I
10.14778/1920841.1921007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quotations, market slogans, and other multi-word phrases that are prominent in a dynamically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases in ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles.
引用
收藏
页码:1348 / 1357
页数:10
相关论文
共 25 条
[1]  
Ahonen H., 1999, LIB TRENDS, V48
[2]  
Bansal N., 2007, VLDB
[3]  
Ben-Yitzhak O., 2008, WSDM 08
[4]  
CHENG H, 2008, ICDE, P169
[5]  
Dash D., 2008, CIKM
[6]   Visualizing Tags over Time [J].
Dubinko, Micah ;
Kumar, Ravi ;
Magnani, Joseph ;
Novak, Jasmine ;
Raghavan, Prabhakar ;
Tomkins, Andrew .
ACM TRANSACTIONS ON THE WEB, 2007, 1 (02)
[7]  
Fagin R., 2005, VLDB
[8]  
Fagin R., 2005, PODS
[9]  
Han JW, 2000, SIGMOD RECORD, V29, P1
[10]   Clustering versus faceted categories for information exploration [J].
Hearst, MA .
COMMUNICATIONS OF THE ACM, 2006, 49 (04) :59-61