Tools/GRCrawler

= GoodRelations Crawler =

The E-Business and Web Science Research Group operates a crawler that is in pre-alpha stadium to date:
 * python-grcrawler/0.2 (+http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler)

Our crawler obeys all important robots.txt directives. That means, it abandons sites if it is not allowed to crawl them, skips directories that are explicitly excluded for proprietary crawlers, and avoids to overload servers by respecting the prescribed crawl delay. For those Web pages that lack a robots.txt file we use our own politeness policy, i.e. we set the crawl delay to a default value of 5 seconds.

Above said, we made also sure that our crawler hits sites with one single thread only, else the policy constraints would be violated.

On your part, to control the interaction behavior of our crawler with your Web site, you could create a robots.txt file or modify it accordingly. For instance, a lower crawl delay and thus a higher amount of successive requests to your site can be obtained as follows: User-agent: python-grcrawler Disallow: Crawl-delay: 1

The crawl delay in the example above is customized to 1 second, as compared to the crawler's default value of 5 seconds.

Acknowledgements
The work on the GoodRelations Crawler has been supported by the German Federal Ministry of Research (BMBF) by a grant under the KMU Innovativ program as part of the Intelligent Match project (FKZ 01IS10022B).