[lug] Web crawler advice

Mon May 5 14:41:33 MDT 2008

gordongoldin at aim.com wrote:
 > pages, but i don't see that it can fetch pages.

import urllib2
pagedata = urllib2.urlopen(url).read()

However, as someone who regularly has to deal with the fallout of poorly
behaving web crawlers I would like to say:

    Be sure to honor the robots.txt

    Please rate-limit the number of pages per second you get from particular
    sites.  Just because you can grab 100 URLs in parallel doesn't mean the
    server can do that without causing other users sessions to slow to a
    crawl.

    Be careful about the number of pages you get from a site.  If you start
    getting more than some number of URLs for a single site, eye-ball them
    to see if you're getting useful data, or if you're just crawling, say,
    the Python package index database or a human genome database.

Sean
-- 
Sean Reifschneider, Member of Technical Staff <jafo at tummy.com>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability