When you reach this page, you are likely trying to find out
We are the research group that invented the GoodRelations vocabulary for e-commerce, that your site uses to send e-commerce data to Google. It seems that you are using this markup on your pages.
It is also most likely that you yourself submitted your sitemap.xml to our crawler using the GR-Notify service (many shop extensions mechanically use this service), though we occasionally find new shop sites by other means.
We automatically visit shop sites that use GoodRelations for two purposes:
In a nutshell, we visit your site to make GoodRelations a better tool for you and 10,000 other site owners.
We provide GoodRelations and all documentation free of charge to site-owners and Google, who make a business out of it. In turn, it would be very kind if you tolerate if we analyze the usage of our work from a research perspective.
Also, in a midterm perspective, allowing us to crawl your site will increase the visibility of your items in novel e-commerce applications.
The crawler will try a deep crawl for all items in a shop, either based on a sitemap.xml file or spidering. This is necessary for the research questions we are tackling.
The crawler identifies itself in the HTTP request header as
python-grcrawler/`` (
[http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler
](http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler))
This way every Web site owner is given the ability to contact us anytime to ask questions regarding our crawler.
Note: In theory, other illegitimate crawlers can use another crawler's identification string to hide their true identity. So far, this has not happened with our signature, but when it doubt, check that the origin IP address is from the range
137.193.166.xxx
Simply indicate that in your site's robots.txt directives:
User-agent: python-grcrawler
Disallow: /
This focused crawler obeys all important robots.txt directives. That means, it abandons sites if it is not allowed to crawl them, skips directories that are explicitly excluded for proprietary crawlers, and avoids to overload servers by respecting the prescribed crawl delay. For those Web pages that lack a robots.txt file we use our own politeness policy, i.e. we set the crawl delay to a default value of 5 seconds.
Above said, we also made sure that our crawler hits sites with a single thread only, otherwise the policy constraints would be violated.
On your part as a Web site owner, to control the interaction behavior of our crawler with your Web site, you could create a robots.txt file or modify it accordingly. For instance, a lower crawl delay and thus a higher amount of successive requests to your site can be obtained as follows:
User-agent: python-grcrawler
Disallow:
Crawl-delay: 1
The crawl delay in the example above is customized to 1 second, as compared to the crawler's default value of 5 seconds.
We feed our crawler with the seed URIs we collected using a central registry component and notification service for GoodRelations-empowered Web pages and Web shops, namely GR-Notify. Furthermore, we sometimes will choose from a list of bigger Web shops with GoodRelations content that we are aware of. By bilateral convention we may then crawl them with a slightly different crawling strategy.
Several domains can be crawled concurrently. However, the architecture of the crawler ensures that no domain is simultaneously hit by more than one process. The maximum number of processes that work in parallel is configurable and limited only by hardware and software constraints.
Based on the type of the input URI the crawler component distinguishes two cases to proceed:
The purpose of our crawler is to find and store contents of Web pages that contain GoodRelations data. Up to now, we are able to extract GoodRelations content encoded as RDFa, Microdata and RDF/XML.
At the moment, we store all structured content we could gather in N-Triples files for uploading them afterwards into a private endpoint for academic use.
The work on the GoodRelations Crawler has been supported by the German Federal Ministry of Research (BMBF) by a grant under the KMU Innovativ program as part of the Intelligent Match project (FKZ 01IS10022B).
Univ.-Prof. Dr. Martin Hepp
E-Business and Web Science Research Group
Chair of General Management and E-Business
Universität der Bundeswehr München
Werner-Heisenberg-Weg 39
D-85579 Neubiberg, Germany
Phone: +49 89 6004-4217
eMail: mhepp(at)computer.org (preferred mode of communication)
Web: <http://www.heppnetz.de/>
Web: <http://www.unibw.de/ebusiness/>