Robots.txt to disallow on multi level subdomains
.htaccess files are extremely useful in many cases for users who either do not have root permissions or for users who simply aren't comfortable in making changes in their web server's configuration file. Trying to debug .htaccess not working isn't always the easiest thing to do, however, hopefully by checking the discuss below mentioned about htaccess, subdomain, robots.txt, multi-subdomains, .htaccess common problems as well as the troubleshooting tips, you'll have a better grasp on what you may have to modify to get your .htaccess file running smoothly.Problem :I've got a server, where I have multiple development projects for clients. While developing, each instance lives on client.dev.example.com.
I need to stop crawlers from crawling the development sites.
Is it possible to make a blanket disallow, for all subdomains of dev.example.com?
While developing, each instance lives on
client.dev.example.com
Any "polite" bot will request the robots.txt file at http://client.dev.example.com/robots.txt. So this request must serve the necessary response. There is nothing you can do in a "parent" robots.txt file to influence other domains/subdomains, since it's simply never requested for this hostname.
If all your subdomains actually point to the same area of the filesystem then you could simply serve the same robots.txt file.
Alternatively, you could have a robots-disallow.txt file with each site and conditionally serve this when accessing the dev site, based on the hostname in the request. At least this way you can still distribute the same codebase for the live site without alteration.
For example, in each sites .htaccess file, you could do something like the following near the top:
RewriteEngine On
RewriteCond %HTTP_HOST ^[a-z-]+.dev.
RewriteRule ^robots.txt$ robots-disallow.txt [L]
This specifically looks for the .dev. subdomain (after the client subdomain) in the request. If found then it internally rewrites any request for robots.txt to robots-disallow.txt. Where robots-disallow.txt consists of something like:
User-agent: *
Disallow: /
You can also have a robots.txt file which contains the "live" settings and is only served on the live site (when .dev. does not occur in the request).
If you have access to the server config, then you could potentially do this once for all sites in the main server config instead of .htaccess. You'll need to tweak the RewriteRule directive if you do, for example:
RewriteRule ^/robots.txt$ /robots-disallow.txt [L]
You could also remove the robots-disallow.txt file from each website and store this once elsewhere on your server (providing it is accessible) and rewrite to this file instead (using the absolute filesystem path) - you can only do this in the server config, not .htaccess.
Seems you need include robots.txt for all this dev sites.
Also how crawlers know about your dev sites? Do you have a links to them?
Probably you just need to remove such links
Comments
Post a Comment