GoodRelations Crawler

GoodRelations Tool
Name GRCrawler
Type Crawler
Language Python
Contact Alex Stolz
Tags goodrelations, focused crawler, e-commerce

When you reach this page, you are likely trying to find out

  • why our crawler has visited your site and is trying to fetch a significant amount of pagers,
  • why this might be in your interest and
  • how you can stop us from crawling your site.

Why are we crawling your site?

We are the research group that invented the GoodRelations vocabulary for e-commerce, that your site uses to send e-commerce data to Google. It seems that you are using this markup on your pages.

It is also most likely that you yourself submitted your sitemap.xml to our crawler using the GR-Notify service (many shop extensions mechanically use this service), though we occasionally find new shop sites by other means.

Why should I allow you crawling our site?

We automatically visit shop sites that use GoodRelations for two purpose:

  1. to see whether GoodRelations is used correctly so that we can improve our documentation and tooling (see navigation bar above), and
  2. to share information about products with novel e-commerce services, like mobile applications and price comparison services.

In a nutshell, we visit your site to make GoodRelations a better tool for you and 10,000 other site owners.

We provide GoodRelations and all documentation free of charge to site-owners and Google, who make a business out of it. In turn, it would be very kind if you tolerate if we analyze the usage of our work from a research perspective.

Also, in a midterm perspective, allowing us to crawl your site will increase the visibility of your items in novel e-commerce applications.

Why are you visiting so many pages?

The crawler will try a deep crawl for all items in a shop, either based on a sitemap.xml file or spidering. This is necessary for the research questions we are tackling.

How can I see that your crawler has visited my site?

The crawler identifies itself in the HTTP request header as

 python-grcrawler/<version> (http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler)

This way every Web site owner is given the ability to contact us anytime to ask questions regarding our crawler.

Note: In theory, other illegitimate crawlers can use another crawler's identification string to hide their true identity. So far, this has not happened with our signature, but when it doubt, check that the origin IP address is from the range

137.193.166.xxx

How can we stop your crawler from visiting our site?

Simply indicate that in your site's robots.txt directives:

User-agent: python-grcrawler
Disallow: /

Additional Information

Politeness

This focused crawler obeys all important robots.txt directives. That means, it abandons sites if it is not allowed to crawl them, skips directories that are explicitly excluded for proprietary crawlers, and avoids to overload servers by respecting the prescribed crawl delay. For those Web pages that lack a robots.txt file we use our own politeness policy, i.e. we set the crawl delay to a default value of 5 seconds.

Above said, we also made sure that our crawler hits sites with a single thread only, otherwise the policy constraints would be violated.

On your part as a Web site owner, to control the interaction behavior of our crawler with your Web site, you could create a robots.txt file or modify it accordingly. For instance, a lower crawl delay and thus a higher amount of successive requests to your site can be obtained as follows:

User-agent: python-grcrawler
Disallow:
Crawl-delay: 1

The crawl delay in the example above is customized to 1 second, as compared to the crawler's default value of 5 seconds.

Source URIs

We feed our crawler with the seed URIs we collected using a central registry component and notification service for GoodRelations-empowered Web pages and Web shops, namely GR-Notify. Furthermore, we sometimes will choose from a list of bigger Web shops with GoodRelations content that we are aware of. By bilateral convention we may then crawl them with a slightly different crawling strategy.

Crawling Strategy

Several domains can be crawled concurrently. However, the architecture of the crawler ensures that no domain is simultaneously hit by more than one process. The maximum number of processes that work in parallel is configurable and limited only by hardware and software constraints.

Based on the type of the input URI the crawler component distinguishes two cases to proceed:

  1. URI describes a sitemap file: The crawler will read its contents and extracts contents page by page. Sometimes however, if sitemaps tend to grow very large, site owners can (and should) use sitemap index files that point to further sitemaps. Our GoodRelations crawler is able to handle them appropriately.
  2. URI is not a sitemap file: The crawler parses the domain name of the URI submitted, and checks if a robots.txt file is available.
    • robots.txt contains reference to sitemap file: Go to step 1.
    • robots.txt contains no reference: The crawler checks the root directory of the domain for a sitemap file. If still no sitemap file could be found, the crawler starts with a depth-first crawl over the whole domain with a configurable maximum crawl depth. Otherwise, go to step 1.

Storage

The purpose of our crawler is to find and store contents of Web pages that contain GoodRelations data. Up to now, we are able to extract GoodRelations content encoded as RDFa, Microdata and RDF/XML.

At the moment, we store all structured content we could gather in N-Triples files for uploading them afterwards into a private endpoint for academic use.

Acknowledgements

The work on the GoodRelations Crawler has been supported by the German Federal Ministry of Research (BMBF) by a grant under the KMU Innovativ program as part of the Intelligent Match project (FKZ 01IS10022B).

Bmbf.png

Contact

Univ.-Prof. Dr. Martin Hepp
E-Business and Web Science Research Group
Chair of General Management and E-Business
Universität der Bundeswehr München
Werner-Heisenberg-Weg 39
D-85579 Neubiberg, Germany

Phone: +49 89 6004-4217
eMail: mhepp(at)computer.org (preferred mode of communication)
Web: http://www.heppnetz.de/
Web: http://www.unibw.de/ebusiness/