Distributed web crawling

Distributed web crawling is a technique used in Internet search engines employing many computers to do the web crawling necessary to index the Internet. The idea is to spread the resource requirements of computing power and bandwidth across many computers and network connections.

As of 2003, most modern commercial search engines use this technique. Companies such as Google use thousands of individual computers in multiple locations to crawl the Web.

Newer projects are attempting to use a less structured, more ad-hoc form of collaboration by enlisting volunteers to join the effort using, in many cases, their home or personal computers. LookSmart is the largest search engine to use this technique in its Grub distributed web-crawling project.

The following is a proposed solution, but does Grub (or others) actually use this algorithm? One solution to this problem is using every computer connected to the Internet to crawl some Internet adresses (URLs) in the background. After downloading the pages, the new pages are compressed and sent back together with a status flag (changed, new, down, redirected) to the powerful central servers. The servers manage a large database and send out new URLs to be tested to all clients.

According to the Nutch FAQ the savings in bandwidth by distributed crawling are not significant, since a successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages.

Distributed web crawling

See also: