[lug] Web crawler advice

Mon May 5 15:56:51 MDT 2008

Sean Reifschneider wrote:
> gordongoldin at aim.com wrote:
>  > pages, but i don't see that it can fetch pages.
> 
> import urllib2
> pagedata = urllib2.urlopen(url).read()
> 
> However, as someone who regularly has to deal with the fallout of poorly
> behaving web crawlers I would like to say:
> 
>    Be sure to honor the robots.txt
> 
>    Please rate-limit the number of pages per second you get from particular
>    sites.  Just because you can grab 100 URLs in parallel doesn't mean the
>    server can do that without causing other users sessions to slow to a
>    crawl.
> 
>    Be careful about the number of pages you get from a site.  If you start
>    getting more than some number of URLs for a single site, eye-ball them
>    to see if you're getting useful data, or if you're just crawling, say,
>    the Python package index database or a human genome database.

Thanks for mentioning this Sean, there are some idiots (er, admins) over 
at Yahoo I would love to strangle... for not doing "sane" behavior along 
these lines.

MySpace and people deep-linking to content off-site is really annoying 
on busy pages on their site too, but that's easily handled with a 
rewrite rule to send them off to REALLY nasty photos (if I'm in a bad 
mood) so they'll stop using me as their "image host", by linking to only 
the images in my content and then loading 100 copies of it every time 
some moron hits refresh on a MySpace page where some doofus has used my 
images in their "avatar".

Nate