Tools/GRCrawler

= GoodRelations Crawler =

The E-Business and Web Science Research Group operates a crawler that collects e-commerce data on the Semantic Web:
 * python-grcrawler/0.3 (http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler)

Authentication
The crawler identifies itself as python-grcrawler/ (http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler) This way every Web site owner is given the ability to contact us anytime to ask questions regarding our crawler.

Politeness
This focused crawler obeys all important robots.txt directives. That means, it abandons sites if it is not allowed to crawl them, skips directories that are explicitly excluded for proprietary crawlers, and avoids to overload servers by respecting the prescribed crawl delay. For those Web pages that lack a robots.txt file we use our own politeness policy, i.e. we set the crawl delay to a default value of 5 seconds.

Above said, we also made sure that our crawler hits sites with a single thread only, otherwise the policy constraints would be violated.

On your part as a Web site owner, to control the interaction behavior of our crawler with your Web site, you could create a robots.txt file or modify it accordingly. For instance, a lower crawl delay and thus a higher amount of successive requests to your site can be obtained as follows: User-agent: python-grcrawler Disallow: Crawl-delay: 1

The crawl delay in the example above is customized to 1 second, as compared to the crawler's default value of 5 seconds.

Source URIs
We feed our crawler with the seed URIs we collected using a central registry component and notification service for GoodRelations-empowered Web pages and Web shops, namely GR-Notify. Furthermore, we sometimes will choose from a list of bigger Web shops with GoodRelations content that we are aware of. By bilateral convention we may then crawl them with a slightly different crawling strategy.

Crawling Strategy
Several domains can be crawled concurrently. However, the architecture of the crawler ensures that no domain is simultaneously hit by more than one process. The maximum number of processes that work in parallel is configurable and limited only by hardware and software constraints.

Based on the type of the input URI the crawler component distinguishes two cases to proceed:


 * 1) URI describes a sitemap file: The crawler will read its contents and extracts contents page by page. Sometimes however, if sitemaps tend to grow very large, site owners can (and should) use sitemap index files that point to further sitemaps. Our GoodRelations crawler is able to handle them appropriately.
 * 2) URI is not a sitemap file: The crawler parses the domain name of the URI submitted, and checks if a robots.txt file is available.
 * 3) * robots.txt contains reference to sitemap file: Go to step 1.
 * 4) * robots.txt contains no reference: The crawler checks the root directory of the domain for a sitemap file. If still no sitemap file could be found, the crawler starts with a depth-first crawl over the whole domain with a configurable maximum crawl depth. Otherwise, go to step 1.

Storage
The purpose of our crawler is to find and store contents of Web pages that contain GoodRelations data. Up to now, we are able to extract GoodRelations content encoded as RDFa, Microdata and RDF/XML.

At the moment, we store all structured content we could gather in N-Triples files for uploading them afterwards into a private endpoint for academic use.

Acknowledgements
The work on the GoodRelations Crawler has been supported by the German Federal Ministry of Research (BMBF) by a grant under the KMU Innovativ program as part of the Intelligent Match project (FKZ 01IS10022B).