Prepared Seminars: Web Crawlers

A Web Crawler or Web Robot is a program that traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. These programs are sometimes called “web robots”, "spiders", "web wanderers", or "web worms". These names, while perhaps more appealing, may be misleading, as the term "spider" and "wanderer" give the false impression that the robot itself moves. In reality robots are implemented as a single software system that retrieves information from remote sites using standard Web protocols.

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).Web crawlers are almost as old as the web itself. The first crawler, Matthew Gray’s Wanderer, was written in the spring of 1993, roughly coinciding with the first release of NCSA Mosaic. Several papers about web crawling were presented at the first two World Wide Web conferences. However, at the time, the web was two to three orders of magnitude smaller than it is today, so those systems did not address the scaling problems inherent in a crawl of today’s web.

Robots can be used for a number of purposes:
• Indexing
• HTML validation
• Link validation
• "What's New" monitoring
• Mirroring

FULL PAPER LINK