[lug] Web crawler advice

Nate Duehr nate at natetech.com
Mon May 5 22:11:44 MDT 2008


On May 5, 2008, at 7:06 PM, Bear Giles wrote:

> karl horlen wrote:
>> Can you say more about how you detect that people are leeching your  
>> site content and how you prevent it.  For instance what specific  
>> rewrite rules or other techniques do you use to help defeat this  
>> type of behavior?
>>
> One standard technique is to look at the REFERER (sic) header. It  
> contains the URL of the page referring to the graphic/page/whatever.  
> Like all headers it's trivially manipulated by a knowledgeable  
> person, but it's a good approach for the casual user.
>
> It's a little confusing at first. Say you're "pooh at woods.com" and  
> you visit the page "badbear.com/lunch.html" that contains a link to  
> the image honeypot.com/daisy.jpg. The server at honeypot.com will  
> see a "remote addr" of woods.com and a REFERER header of  
> 'badbear.com/lunch.html"
>
> It can then decide what to do. Many sites block deep linking by  
> checking the REFERER and blocking queries from outside of its own  
> domain. More casual approaches would redirect queries with a REFERER  
> link from specific blacklisted domains.


Yep, that's how I found it.  I could care less about "casual" deep- 
linking to my personal site, but when you're getting bombarded by the  
crappy MySpace stuff (and the browser sends the REFERRER stuff  
correctly) it's pretty obvious... the web server logs are pounded.

I've since sent not only myspace referrals but also blogspot and  
livejournal to the bit-bucket.  Could care less if people linking from  
those sites see what they want to see on my pages.

I even had a guy COMPLAIN that he had been SELLING people "custom  
MySpace pages" that included deep-links to my site, and that I had  
"broke" them.  What a tard.

I suppose I could have turned that into an opportunity of some kind,  
but I just replied saying he was welcome to find the same funny photos  
and things I had on my webserver out on the net and host them on his  
own webservers to deal with the crushing load he'd put on a box on a  
residential connection, that was never meant to service half of the  
world's MySpace teenie boppers saying, "Dude - UR sooo HOOTTT!" to  
some girl they don't know.

I have stuff I don't even know for sure is not copyrighted, up on the  
blog... I would never make a buck on any of it.  It's just posted as a  
"ha-ha funny" type of thing on my blog pages and I always copy it down  
(to save their server from load) and give credit for where it was  
"found" with a link, if it wasn't e-mailed to me.

Anyway... since someone else shared, I redirect them to this:

<http://publishing2.com/images/LostCherry%20MySpace%20Sucks.gif>

[Of course, publishing2 appears to have problems of their own...]

<http://publishing2.com/images>

And the graphic comes from this article:

http://publishing2.com/2006/06/13/lostcherry-takes-aim-at-myspace/

Where there's bitching about MySpace, talk of some anti-MySpace site  
called "LostCherry", and then even more bitching about Digg "burying"  
the "Lost Cherry Story"...

Basically, I redirect the cesspool back to the cesspool, I figure.    
Plus it just continues the "controversy chain" ad-nauseam.  Might as  
well.  These sites love this kind of crap.  More traffic to claim to  
their advertisers, else they wouldn't have a business model.

The ADD Poster Children who don't understand HTML or browsers who want  
to "investigate" why they're getting a "new" graphic some way they  
don't understand, end up chasing around wondering who publishing2 is,  
find the article, and say "ooh, shiny!" and dive into the comment  
sections of publishing2, LostCherry, MySpace and Digg to continue the  
bitch-fest.

Probably, anyway...

Of course, it's a never-ending game.  I wonder how many rewrites from  
Apache a browser will follow before it gives up.  Might be fun to  
redirect to a pool of high-bandwidth servers in a circular rewrite,  
where one hands to the other, which hands to a third, which hands back  
to the original... but I'm not THAT evil.  If the browsers don't stop  
the chain, and I bet they don't... you could probably lock up  
someone's browser bad enough that they would have to close all of  
their tabs and start over.  Imagine that happening in an image link on  
some doofuses MySpace page.

Game over.  He who dies with the most bandwidth wins.

--
Nate Duehr
nate at natetech.com






More information about the LUG mailing list