Reply to comment
How to block a bad bot or spider
Whenever we put our websites online, we open the doors for any and all spiders and bots to start trawling our content. Most of the time this is not too much of a bother, as long as they don't get stuck in broken loops which can burn up bandwidth, or slow down servers.
In theory we're able to use robots.txt rules to tell spiders what they're allowed to look at, and what they're not welcome to... of course plenty of spiders completely ignore robots.txt and do whatever the creator has instructed them to do... whether it's something shady, or even simply gathering data and/or files for a privately run search engine.
This is where other useful tools such as htaccess and modrewrite come in. A lot of spiders do actually include an identifying "useragent", and so by adding the following lines to our htaccess file we can deny access to a list of these "bad bots".
In this case we're blocking a couple of spambots and a media file crawler called asterias, which is associated with Singingfish - an mp3 search engine. Singingfish has the totally idiotic habit of crawling websites looking for mp3 files, and then immediately downloading every single one they find. Nice.
# Turn on mod_rewrite
RewriteEngine On
# Expandable bot list (Note: no OR on the last one)
RewriteCond %{HTTP_USER_AGENT} asterias [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Exabot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nicebot [NC]
RewriteRule .* - [F]
Its a good idea to double check that only the correct useragents are being blocked, and you haven't accidentally banned every visitor and search engine under the sun!
Here's a handy HTTP request viewer which can be used to do just that.
