Rules in .htaccess to block spiders don't appear to be effective, I still see the crawlers in Awstats
.htaccess files are extremely useful in many cases for users who either do not have root permissions or for users who simply aren't comfortable in making changes in their web server's configuration file. Trying to debug .htaccess not working isn't always the easiest thing to do, however, hopefully by checking the discuss below mentioned about htaccess, , , , .htaccess common problems as well as the troubleshooting tips, you'll have a better grasp on what you may have to modify to get your .htaccess file running smoothly.Problem :I have Put this code in htaccess to prevent search engines accessing my site
However, I still see them listed daily in the AWSTATS file on my sever.
Does this mean they are still searching my site? They haven't infiltrated the site but just logged as an attempt?
# Stop the Nasties!!
RewriteEngine on
RewriteCond %HTTP_USER_AGENT ^autoemailspider [OR]
RewriteCond %HTTP_USER_AGENT ^baidu [NC,OR]
RewriteCond %HTTP_USER_AGENT ^Baiduspider [OR]
RewriteCond %HTTP_USER_AGENT ^Baiduspider* [OR]
RewriteCond %HTTP_USER_AGENT ^Bingbot[OR]
RewriteCond %HTTP_USER_AGENT ^Yandex [OR]
RewriteCond %HTTP_USER_AGENT ^Sosospider [OR]
RewriteCond %HTTP_USER_AGENT ^AhrefsBot[OR]
RewriteCond %HTTP_USER_AGENT ^AITCSRobot [OR]
RewriteCond %HTTP_USER_AGENT ^Arachnophilia [OR]
RewriteCond %HTTP_USER_AGENT ^archive.org_bot [OR]
RewriteCond %HTTP_USER_AGENT ^BackDoorBot[OR]
RewriteCond %HTTP_USER_AGENT ^BSpider [OR]
RewriteCond %HTTP_USER_AGENT ^CFNetwork[OR]
RewriteCond %HTTP_USER_AGENT ^CyberPatrol [OR]
RewriteCond %HTTP_USER_AGENT ^DeuSu[OR]
RewriteCond %HTTP_USER_AGENT ^DotBot [OR]
RewriteCond %HTTP_USER_AGENT ^EmailCollector [OR]
RewriteCond %HTTP_USER_AGENT ^FeedlyBot[OR]
RewriteCond %HTTP_USER_AGENT ^Genieo [OR]
RewriteCond %HTTP_USER_AGENT ^Gluten Free Crawler [OR]
RewriteCond %HTTP_USER_AGENT ^GrapeshotCrawler [OR]
RewriteCond %HTTP_USER_AGENT ^MaxPointCrawler [OR]
RewriteCond %HTTP_USER_AGENT ^meanpathbot [OR]
RewriteCond %HTTP_USER_AGENT ^MJ12bot [OR]
RewriteCond %HTTP_USER_AGENT ^PagesInventory [OR]
RewriteCond %HTTP_USER_AGENT ^PHP [OR]
RewriteCond %HTTP_USER_AGENT ^Plukkie [OR]
RewriteCond %HTTP_USER_AGENT ^Qwantify [OR]
RewriteCond %HTTP_USER_AGENT ^SemrushBot [OR]
RewriteCond %HTTP_USER_AGENT ^SentiBot [OR]
RewriteCond %HTTP_USER_AGENT ^SEOkicks-Robot [OR]
RewriteCond %HTTP_USER_AGENT ^SeznamBot [OR]
RewriteCond %HTTP_USER_AGENT ^WeSEE_Bot [OR]
RewriteCond %HTTP_USER_AGENT ^worldwebheritage.org [OR]
RewriteCond %HTTP_USER_AGENT ^Xenu Link Sleuth [OR]
RewriteCond %HTTP_USER_AGENT ^Yahoo! Slurp[OR]
RewriteCond %HTTP_USER_AGENT ^Zeus [OR]
RewriteCond %HTTP_USER_AGENT ^SogouwebSpider [OR]
RewriteCond %HTTP_USER_AGENT ^360Spider [OR]
RewriteRule ^.* - [F,L]
Yes, this is the code I am using, copied from here. Looking at my RAW log I don't see any forbidden entries Here are 2 examples of today entry Should I delete the OR entries and what do they mean?
207.46.13.186 - - [30/Nov/2016:12:05:19 +0000] "GET /comrades/comrades%20football%20team2.jpg HTTP/1.1" 200 47649 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
52.213.197.166 - - [03/Dec/2016:14:54:02 +0000] "GET /robots.txt HTTP/1.0" 200 1473 "-" "IDG/UK (http://spaziodati.eu/)"
:
RewriteCond %HTTP_USER_AGENT ^Yahoo! Slurp[OR]
RewriteCond %HTTP_USER_AGENT ^Zeus [OR]
RewriteCond %HTTP_USER_AGENT ^SogouwebSpider [OR]
RewriteCond %HTTP_USER_AGENT ^360Spider [OR]
RewriteRule ^.* - [F,L]
This code is actually "broken" in several places and will never work as intended. In fact, it won't block anything in its current state, which explains your access log.
You need to remove the
ORflag on the lastRewriteConddirective. This additionalORflag would ordinarily cause all traffic to be blocked!
(But since you have further errors - see #2 - this does not happen!)RewriteCond %HTTP_USER_AGENT ^Bingbot[OR]
You are missing a space between the CondPattern (^Bingbot) and the flags argument ([OR]). (It should be^Bingbot [OR].) This won't match "Bingbot". But, crucially, the condition is now an implicitAND- so your rule block will never succeed and no bot will be blocked! I count at least 7 directives in your code above where the space is missing!As Stephen has already pointed out in comments, the regex used to match these bots do not necessarily seem to be correct. For example, a pattern such as
^Bingbotmatches the exact string "Bingbot" (capital "B") at the start of the user-agent (^being a start-of-string anchor). But the log entry you've shown contains "bingbot" (all lowercase) in the middle of the user-agent string. This will not match. You probably need a condition like the following, without a^prefix and with theNCflag for a case-insensitive match:RewriteCond %HTTP_USER_AGENT bingbot [NC,OR]You'll need to check the other regex, whether they match the User-Agent you are trying to target. Are you matching at the start of the UA (
^)? Should it be case-insensitive (NC)?Minor point... Given the following two directives, the first one is superfluous. However, the second one looks like an error.
RewriteCond %HTTP_USER_AGENT ^Baiduspider [OR]
RewriteCond %HTTP_USER_AGENT ^Baiduspider* [OR]
However, I still see them listed daily in the AWSTATS file on my sever.
Yes, even if you block the bots (once your code is working), they will still "hit" your server and be logged in the server's access log from which AWStats builds its reports.
However, check your raw access log and you should see a 403 (Forbidden) in the response status for these requests (this is probably reported in AWStats as well). If not, then something is wrong.
The RewriteRule can also be simplified:
RewriteRule ^ - [F]
The L flag is implied when you use the F flag.
Comments
Post a Comment