Concept decompositions for large sparse text data using clustering

被引:753
作者
Dhillon, IS [1 ]
Modha, DS
机构
[1] Univ Texas, Dept Comp Sci, Austin, TX 78712 USA
[2] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA
基金
美国国家科学基金会;
关键词
concept vectors; fractals; high-dimensional data; information retrieval; k-means algorithm; least-squares; principal angles; principal component analysis; self-similarity; singular value decomposition; sparsity; vector space models; text mining;
D O I
10.1023/A:1007612920971
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors-a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized "basis" for text data sets.
引用
收藏
页码:143 / 175
页数:33
相关论文
共 41 条
[1]  
[Anonymous], 1949, Human behaviour and the principle of least-effort
[2]   Using linear algebra for intelligent information retrieval [J].
Berry, MW ;
Dumais, ST ;
OBrien, GW .
SIAM REVIEW, 1995, 37 (04) :573-595
[3]  
Bjorck A., 1973, MATH COMPUTATION, V27
[4]   Document categorization and query generation on the World Wide Web using WebACE [J].
Boley, D ;
Gini, M ;
Gross, R ;
Han, EH ;
Hastings, K ;
Karypis, G ;
Kumar, V ;
Mobasher, B ;
Moore, J .
ARTIFICIAL INTELLIGENCE REVIEW, 1999, 13 (5-6) :365-391
[5]  
Broder A.Z., 1997, 1997015 DIG SYST RES
[6]  
Caid W R, 1997, US Patent, Patent No. [5,619,709, 5619709]
[7]  
CUTTING DR, 1992, P ACM SIGIR
[8]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[9]  
2-9
[10]   A data-clustering algorithm on distributed memory multiprocessors [J].
Dhillon, IS ;
Modha, DS .
LARGE-SCALE PARALLEL DATA MINING, 2000, 1759 :245-260