An analytical approach to concept extraction in HTML']HTML environments

被引:11
作者
Fresno, V [1 ]
Ribeiro, A
机构
[1] Rey Juan Carlos Univ, Escuela Super Ciencias Expt & Tecnol, Madrid 28933, Spain
[2] CSIC, Spanish Council Sci Res, IAI, Madrid 28500, Spain
关键词
concept extraction; feature vector in [!text type='HTML']HTML[!/text] texts; Web page characterization; Web page representation; Web page classification;
D O I
10.1023/B:JIIS.0000019277.82436.17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The core of the Internet and World Wide Web revolution comes from their capacity to efficiently share the huge quantity of data, but the rapid and chaotic growth of the Net has extremely complicated the task of sharing or mining useful information. Each inference process, from Internet information, requires an adequate characterization of the Web pages. The textual part of a page is one of the most important aspects that should be considered to appropriately perform a page characterization. The textual characterization should be made through the extraction of an appropriate set of relevant concepts that properly represent the text included in the Web page. This paper presents a method to obtain such a set of relevant concepts from a Web page, essentially based on a relevance estimation of each word in the text of a Web page. The word-relevance is defined by a combination of criteria that take into account characteristics of the HTML language as well as more classical measures such as the frequency and the position of a word in a document. Besides, heuristic rules to obtain the most suitable fusion of criteria is achieved via a statistical study. Several experiments are conducted to test the performance of the proposed concept extraction method compared to other approaches including a commercial tool. The results obtained here exhibit a greater success in the concept extraction by the proposed technique against other tested methods.
引用
收藏
页码:215 / 235
页数:21
相关论文
共 17 条
[1]  
[Anonymous], INT C MACH LEARN ICM
[2]  
[Anonymous], 2002, E COMM DEV REP 2002
[3]  
BAEZAYATES RA, 1999, MODERN INFORMATION R
[4]  
CHEN H, 2000, P SIGCHI C HUM FACT, P145
[5]  
Dunham M. H., 2002, DATA MINING INTRO AD
[6]  
FRESNO V, 2001, INT ICSC C COMP INT, P416
[7]   A METHOD FOR DISAMBIGUATING WORD SENSES IN A LARGE CORPUS [J].
GALE, WA ;
CHURCH, KW ;
YAROWSKY, D .
COMPUTERS AND THE HUMANITIES, 1992, 26 (5-6) :415-439
[8]  
GUDIVADA VN, 1997, IEEE INTERNET CO SEP, P58
[9]  
HENZINGER M, 2000, B TECHNICAL COMMITTE, V23, P3
[10]  
HOVY E, 1999, ADV AUTOMATIC TEXT S