[lug] Web crawler advice
Sean Reifschneider
jafo at tummy.com
Mon May 5 14:41:33 MDT 2008
gordongoldin at aim.com wrote:
> pages, but i don't see that it can fetch pages.
import urllib2
pagedata = urllib2.urlopen(url).read()
However, as someone who regularly has to deal with the fallout of poorly
behaving web crawlers I would like to say:
Be sure to honor the robots.txt
Please rate-limit the number of pages per second you get from particular
sites. Just because you can grab 100 URLs in parallel doesn't mean the
server can do that without causing other users sessions to slow to a
crawl.
Be careful about the number of pages you get from a site. If you start
getting more than some number of URLs for a single site, eye-ball them
to see if you're getting useful data, or if you're just crawling, say,
the Python package index database or a human genome database.
Sean
--
Sean Reifschneider, Member of Technical Staff <jafo at tummy.com>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability
More information about the LUG
mailing list