ScoopGracie

ScoopGracieBot

ScoopGracieBot 2.0 is an experimental Web crawler for Linux written mostly in Python 3, with the robots.txt implementation1 in C++. Its goal is to create a sort of "map" of the Web. Note that it is still very much a work in progress.

I currently do not run it much, as I have no SSH access to this Web server and limited bandwidth on my home computer. However, once I upgrade my server to a VPS within the next few months, I will run it frequently and possibly make a link search engine based on it.

Code, Documentation, and License

The bot is released under the Apache License 2.0 and is available on my source code downloads page. Build and usage instructions are included in the tarball. I have not written any detailed documentation yet, but reading the code (in the scoopgraciebot/bot.py3 file in the tarball) should explain how it works.

Controlling ScoopGracieBot

Unlike ScoopGracieBot 1.0, ScoopGracieBot 2.0 obeys the following bot control standards:

If you do not want ScoopGracieBot 2.0 to crawl your site, you can block it in robots.txt:

User-agent: ScoopGracieBot
	Disallow: /

Note that it may request up to ten more pages from your site before stopping after being blocked. This is because it only requests robots.txt every 10 times it visits a site to save bandwidth.

User-Agent String

Requests from the following User-Agent are from ScoopGracieBot 2.0:
Mozilla/5.0 (compatible, ScoopGracieBot 2.0) https://scoopgracie.com/scoopgraciebot/

Duplicate URL Handling

It handles duplicate URLs if and only if all of the duplicates have <link rel=canonical> tags pointing to the canonical URL of the group. (Many SEO experts recommend using these tags on every page, and so do we. Not only does it help ScoopGracieBot crawl your site, but it also helps Google and other search engines.)

Notes

  1. I use Google's implementation for robots.txt.
  2. This description may not accurately describe the version downloaded from my source code page, as I add information about new features as soon as I add them to my local copy.