Tools/GRCrawler

The E-Business and Web Science Research Group operates a crawler that is currently in alpha stage.
 * grcrawler/0.2 (+http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler)

Our crawler obeys all important robots.txt directives. That means, it abandons sites if it is not allowed to crawl them, skips directories that are explicitly excluded for proprietary crawlers, and avoids to overload servers respecting the crawl delay. For those Web pages that lack a robots.txt file we use a politeness policy, i.e. we set the crawl delay to a default value of 5 seconds. Said the above, we made also sure that our crawler hits sites with one single thread only.