[lug] Web crawler advice
Nate Duehr
nate at natetech.com
Mon May 5 15:56:51 MDT 2008
Sean Reifschneider wrote:
> gordongoldin at aim.com wrote:
> > pages, but i don't see that it can fetch pages.
>
> import urllib2
> pagedata = urllib2.urlopen(url).read()
>
> However, as someone who regularly has to deal with the fallout of poorly
> behaving web crawlers I would like to say:
>
> Be sure to honor the robots.txt
>
> Please rate-limit the number of pages per second you get from particular
> sites. Just because you can grab 100 URLs in parallel doesn't mean the
> server can do that without causing other users sessions to slow to a
> crawl.
>
> Be careful about the number of pages you get from a site. If you start
> getting more than some number of URLs for a single site, eye-ball them
> to see if you're getting useful data, or if you're just crawling, say,
> the Python package index database or a human genome database.
Thanks for mentioning this Sean, there are some idiots (er, admins) over
at Yahoo I would love to strangle... for not doing "sane" behavior along
these lines.
MySpace and people deep-linking to content off-site is really annoying
on busy pages on their site too, but that's easily handled with a
rewrite rule to send them off to REALLY nasty photos (if I'm in a bad
mood) so they'll stop using me as their "image host", by linking to only
the images in my content and then loading 100 copies of it every time
some moron hits refresh on a MySpace page where some doofus has used my
images in their "avatar".
Nate
More information about the LUG
mailing list