[lug] Web crawler advice

Mon May 5 19:06:26 MDT 2008

karl horlen wrote:
> Can you say more about how you detect that people are leeching your site content and how you prevent it.  For instance what specific rewrite rules or other techniques do you use to help defeat this type of behavior?
>   
One standard technique is to look at the REFERER (sic) header. It 
contains the URL of the page referring to the graphic/page/whatever. 
Like all headers it's trivially manipulated by a knowledgeable person, 
but it's a good approach for the casual user.

It's a little confusing at first. Say you're "pooh at woods.com" and you 
visit the page "badbear.com/lunch.html" that contains a link to the 
image honeypot.com/daisy.jpg. The server at honeypot.com will see a 
"remote addr" of woods.com and a REFERER header of 'badbear.com/lunch.html"

It can then decide what to do. Many sites block deep linking by checking 
the REFERER and blocking queries from outside of its own domain. More 
casual approaches would redirect queries with a REFERER link from 
specific blacklisted domains.

> Do you automate the leech detection?  I'd think it would be pretty tedious to periodically manually inspect the logs looking for this type of behavior.    Do you have a cron script that periodically checks for certain logfile entries?  If so would you mind sharing some of it or some techniques used to detect the rogue hits?
>
> Finally. Is there any way that one could "inject" "id info" in site content / pages and then later do a google search with those "id tags" to see if any other site pages have been spidered under those id tags?  I'm thinking that if you injected a really unique id tag in the html code, like an element attribute that wouldn't be displayed, it might actually get flagged by google.  Just a thought?
>