Fix Htaccess Issues

Question 1

I've got a server, where I have multiple development projects for clients. While developing, each instance lives on client.dev.example.com.

I need to stop crawlers from crawling the development sites.

Is it possible to make a blanket disallow, for all subdomains of dev.example.com?

Question 2

While developing, each instance lives on client.dev.example.com

Any "polite" bot will request the robots.txt file at http://client.dev.example.com/robots.txt. So this request must serve the necessary response. There is nothing you can do in a "parent" robots.txt file to influence other domains/subdomains, since it's simply never requested for this hostname.

If all your subdomains actually point to the same area of the filesystem then you could simply serve the same robots.txt file.

Alternatively, you could have a robots-disallow.txt file with each site and conditionally serve this when accessing the dev site, based on the hostname in the request. At least this way you can still distribute the same codebase for the live site without alteration.

For example, in each sites .htaccess file, you could do something like the following near the top:

RewriteEngine On
RewriteCond %HTTP_HOST ^[a-z-]+.dev.
RewriteRule ^robots.txt$ robots-disallow.txt [L]

This specifically looks for the .dev. subdomain (after the client subdomain) in the request. If found then it internally rewrites any request for robots.txt to robots-disallow.txt. Where robots-disallow.txt consists of something like:

User-agent: *
Disallow: /

You can also have a robots.txt file which contains the "live" settings and is only served on the live site (when .dev. does not occur in the request).

If you have access to the server config, then you could potentially do this once for all sites in the main server config instead of .htaccess. You'll need to tweak the RewriteRule directive if you do, for example:

RewriteRule ^/robots.txt$ /robots-disallow.txt [L]

You could also remove the robots-disallow.txt file from each website and store this once elsewhere on your server (providing it is accessible) and rewrite to this file instead (using the absolute filesystem path) - you can only do this in the server config, not .htaccess.

Question 3

Seems you need include robots.txt for all this dev sites.
Also how crawlers know about your dev sites? Do you have a links to them?
Probably you just need to remove such links

Search This Blog

Fix Htaccess Issues

Robots.txt to disallow on multi level subdomains

Comments

Post a Comment

Popular posts from this blog

Rewrite in Mediawiki, remove index.php, .htaccess

.htaccess rewrite wildcard folder paths from host

Using .htaccess to set a cookie and 301 redirect