Text categorization with support vector machines.: How to represent texts in input space?

被引:262
作者
Leopold, E [1 ]
Kindermann, J [1 ]
机构
[1] GMD German Natl Res Ctr Informat Technol, Inst Autonomous Intelligent Syst, D-53754 St Augustin, Germany
关键词
support vector machines; text classification; lemmatization; stemming; kernel functions;
D O I
10.1023/A:1012491419635
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The choice of the kernel function is crucial to most applications of support vector machines. In this paper, however, we show that in the case of text classification, term-frequency transformations have a larger impact on the performance of SVM than the kernel itself. We discuss the role of importance-weights (e.g. document frequency and redundancy), which is not yet fully understood in the light of model complexity and calculation cost, and we show that time consuming lemmatization or stemming can be avoided even when classifying a highly inflectional language like German.
引用
收藏
页码:423 / 444
页数:22
相关论文
共 22 条
[1]  
Altmann G., 1988, WIEDERHOLUNGEN TEXTE
[2]  
[Anonymous], 1998, LECT NOTES COMPUTER, DOI DOI 10.1007/S13928716
[3]  
Balasubrahmanyan Viddhachalam K., 1996, J QUANT LINGUIST, V3, P177, DOI DOI 10.1080/09296179608599629
[4]   PROBABILISTIC MODELS FOR AUTOMATIC INDEXING [J].
BOOKSTEIN, A .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1974, 25 (05) :312-318
[5]  
CHITASHVILI RJ, 1993, QUANTITATIVE TEXT AN, P46
[6]  
Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
[7]  
Grotjahn R., 1982, Zeitschrift fur Sprachwissenschaft, V1, P44
[8]  
HARTER PS, 1975, J ASIS 1, V26, P197
[9]  
Kralik J., 1977, PRAGUE STUDIES MATH, V5, P223
[10]  
KRYLOV JK, 1995, J QUANTITATIVE LINGU, V2, P157