Main Page | See live article | Alphabetical index

Grub distributed web-crawling project

Grub is the name for a search engine pioneered by LookSmart based on the power of distributed computing. Users may download the grubclient software and let it run during computer idle time. The client indexes URLs and sends them back to the main grub server in a highly compressed form. The collective cache can then be searched on the Grub website. Grub is able to quickly build a large cache by asking thousands of clients to cache a small portion of the web each.

Though many believe in Grub's novel distributed computing system, the search engine has its share of opponents. Many state that a large cache is not the strength of a good search engine, rather, that it is the ability to deliver accurate, precise results to users. Loyal fans of Google state that they enjoy that search engine for its targeted results and would not switch to Grub unless its search technology were superior to Google's. Quite a few webmasters are opposed to Grub for its apparent ignorance of sites' robots.txt files. These files can prevent robots from caching certain areas. Because Grub, as its developers claim, also caches robots.txt, changes to the file may not be detected. Webmasters counter that Grub does not understand long-lasting robots.txt files blocking access to all crawlers. According to Wikipedia's own webmasters, the /w/ directory, which stores the scripts for page-editing, etc. and is blocked to robots by robots.txt, is cached by Grub but no other search engine. Wikipedia's webmasters also complain that Grub's distributed architecture creates server overload by keeping open a large number of TCP connections — the effects of this are essentially the same as a typical distributed denial of service attack.

References

Two posts, [1] and [1], to Wikitech-L by Brion Vibber, one of Wikipedia's developers.