[lug] Web crawler advice

karl horlen horlenkarl at yahoo.com
Mon May 5 17:18:35 MDT 2008


Can you say more about how you detect that people are leeching your site content and how you prevent it.  For instance what specific rewrite rules or other techniques do you use to help defeat this type of behavior?

Do you automate the leech detection?  I'd think it would be pretty tedious to periodically manually inspect the logs looking for this type of behavior.    Do you have a cron script that periodically checks for certain logfile entries?  If so would you mind sharing some of it or some techniques used to detect the rogue hits?

Finally. Is there any way that one could "inject" "id info" in site content / pages and then later do a google search with those "id tags" to see if any other site pages have been spidered under those id tags?  I'm thinking that if you injected a really unique id tag in the html code, like an element attribute that wouldn't be displayed, it might actually get flagged by google.  Just a thought?

Thanks


> MySpace and people deep-linking to content off-site is
> really annoying 
> on busy pages on their site too, but that's easily
> handled with a 
> rewrite rule to send them off to REALLY nasty photos (if
> I'm in a bad 
> mood) so they'll stop using me as their "image
> host", by linking to only 
> the images in my content and then loading 100 copies of it
> every time 
> some moron hits refresh on a MySpace page where some doofus
> has used my 
> images in their "avatar".
> 
> Nate
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List:
> http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ



More information about the LUG mailing list