ScoopGracieBot 2.0 is an experimental Web crawler for Linux written mostly in Python 3, with the
robots.txt implementation1 in C++. Its goal is to create a sort of "map" of the Web. Note that it is still very much a work in progress.
I currently do not run it much, as I have no SSH access to this Web server and limited bandwidth on my home computer. However, once I upgrade my server to a VPS within the next few months, I will run it frequently and possibly make a link search engine based on it.
Code, Documentation, and License
The bot is released under the Apache License 2.0 and is available on my source code downloads page. Build and usage instructions are included in the tarball. I have not written any detailed documentation yet, but reading the code (in the
scoopgraciebot/bot.py3 file in the tarball) should explain how it works.
Unlike ScoopGracieBot 1.0, ScoopGracieBot 2.0 obeys the following bot control standards:
If you do not want ScoopGracieBot 2.0 to crawl your site, you can block it in
User-agent: ScoopGracieBot Disallow: /
Note that it may request up to ten more pages from your site before stopping after being blocked. This is because it only requests robots.txt every 10 times it visits a site to save bandwidth.
Requests from the following User-Agent are from ScoopGracieBot 2.0:
Mozilla/5.0 (compatible, ScoopGracieBot 2.0) https://scoopgracie.com/scoopgraciebot/
Duplicate URL Handling
It handles duplicate URLs if and only if all of the duplicates have
<link rel=canonical> tags pointing to the canonical URL of the group. (Many SEO experts recommend using these tags on every page, and so do we. Not only does it help ScoopGracieBot crawl your site, but it also helps Google and other search engines.)
- I use Google's implementation for
- This description may not accurately describe the version downloaded from my source code page, as I add information about new features as soon as I add them to my local copy.