Web robot detection techniques: overview and limitations

被引:49
作者
Doran, Derek [1 ]
Gokhale, Swapna S. [1 ]
机构
[1] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT 06269 USA
关键词
Web Crawler; Web Robot; WWW; Web Robot Detection; Web User Classification; DISCOVERY;
D O I
10.1007/s10618-010-0180-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most modern Web robots that crawl the Internet to support value-added services and technologies possess sophisticated data collection and analysis capabilities. Some of these robots, however, may be ill-behaved or malicious, and hence, may impose a significant strain on a Web server. It is thus necessary to detect Web robots in order to block undesirable ones from accessing the server. Such detection is also essential to ensure that the robot traffic is considered appropriately in the performance and capacity planning of Web servers. Despite a variety of Web robot detection techniques, there is no consensus regarding a single technique, or even a specific "type" of technique, that performs well in practice. Therefore, to aid in the development of a practically applicable robot detection technique, this survey presents a critical analysis and comparison of the prevalent detection approaches. We propose a framework to classify the existing detection techniques into four categories based on their underlying detection philosophy. We compare the different classes to gain insights into those characteristics that make up an effective robot detection scheme. Finally, we discuss why the contemporary techniques fail to offer a general solution to the robot detection problem and propose a set of key ingredients necessary for strong Web robot detection.
引用
收藏
页码:183 / 210
页数:28
相关论文
共 34 条
[1]  
[Anonymous], AWSTATS FREE LOG FIL
[2]  
[Anonymous], 2008 IEEE GLOB TEL C
[3]  
[Anonymous], 2009, P 18 INT C WORLD WID
[4]   Web robot detection - Preprocessing web logfiles for robot detection [J].
Bomhardt, C ;
Gaul, W ;
Schmidt-Thieme, L .
NEW DEVELOPMENTS IN CLASSIFICATION AND DATA ANALYSIS, 2005, :113-124
[5]  
Buzikashvili N, 2008, P WORKSH INF, P35
[6]   An investigation of web crawler behavior: characterization and metrics [J].
Dikaiakos, MD ;
Stassopoulou, A ;
Papageorgiou, L .
COMPUTER COMMUNICATIONS, 2005, 28 (08) :880-897
[7]  
Doran Derek, 2009, Proceedings 21st International Conference on Software Engineering & Knowledge Engineering (SEKE 2009), P97
[8]  
Doran Derek, 2008, 2008 7th IEEE International Symposium on Network Computing and Applications (NCA), P275, DOI 10.1109/NCA.2008.47
[9]  
Duskin Omer., 2009, Proceedings of the 2009 workshop on Web Search Click Data, P15, DOI DOI 10.1145/1507509.1507512
[10]  
Geens N, 2006, LECT NOTES ARTIF INT, V4065, P121