7.16.2008

how to kick out the Robots&Spiders

At first, I ‘d like to thanks for xinbin.chen’s help.
He should be a good teacher who could teach u how to fish not just gave u some fishes.

No more words, let’s get to the point.

There are two popular methods to kick out the Robots & Spiders.

1: put robots.txt in the www root.
the rules are:

#that u want to allow the spider get ur site
User-agent: somespider
Disallow:

#disallow all the spiders get ur site
User-agent: *
Disallow: /

#disallow certain spider get ur site
User-agent: spider
Disallow: /

but now many spider can pretend themselves as FF, Opera, IE.
so we need some other method. and then the 2nd.

2: If u use apache, U got the good idea.
add the following lines to the httpd.conf

At first, I ‘d like to thanks for xinbin.chen’s help.
He should be a good teacher who could teach u how to fish not just gave u some fishes.

No more words, let’s get to the point.

There are two popular methods to kick out the Robots & Spiders.

1: put robots.txt in the www root.
the rules are:

#that u want to allow the spider get ur site
User-agent: somespider
Disallow:

#disallow all the spiders get ur site
User-agent: *
Disallow: /

#disallow certain spider get ur site
User-agent: spider
Disallow: /

but now many spider can pretend themselves as FF, Opera, IE.
so we need some other method. and then the 2nd.

2: If u use apache, U got the good idea.
add the following lines to the httpd.conf

At first, I ‘d like to thanks for xinbin.chen’s help.
He should be a good teacher who could teach u how to fish not just gave u some fishes.

No more words, let’s get to the point.

There are two popular methods to kick out the Robots & Spiders.

1: put robots.txt in the www root.
the rules are:

#that u want to allow the spider get ur site
User-agent: somespider
Disallow:

#disallow all the spiders get ur site
User-agent: *
Disallow: /

#disallow certain spider get ur site
User-agent: spider
Disallow: /

but now many spider can pretend themselves as FF, Opera, IE.
so we need some other method. and then the 2nd.

2: If u use apache, U got the good idea.
add the following lines to the httpd.conf

SetEnvIfNoCase User_Agent Robot a_robot=1
SetEnvIfNoCase User_Agent Spider a_robot=1

the Robot/Spider can change to the spider/robot u want to forbidden.

# omit some lines……………………

Order allow,deny
Allow from all
Deny from env=a_robot

apache graceful , it will work.

The is a story when I did this mission. we have squid for our site. I change the proxy and don’t let it cache the dest site. but it still sth wrong. I don’t know how to reslove it until chen told me some princples of squid.

change the paras.

#this two lines for direct the domain not cache.
acl targetdomain dstdomain .urdomain.com
always_direct allow targetdomain

#this line just tell squid not cache the errors, such as ERROR/ forbidden information.
negative_ttl 0

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home