[lug] Web crawler advice

Jeff Falgout jtfalgout at gmail.com
Mon May 5 17:32:59 MDT 2008


On Mon, May 5, 2008 at 2:41 PM, Sean Reifschneider <jafo at tummy.com> wrote:

>  However, as someone who regularly has to deal with the fallout of poorly
>  behaving web crawlers I would like to say:
>
>    Be sure to honor the robots.txt
>
>    Please rate-limit the number of pages per second you get from particular
>    sites.  Just because you can grab 100 URLs in parallel doesn't mean the
>    server can do that without causing other users sessions to slow to a
>    crawl.
>
>    Be careful about the number of pages you get from a site.  If you start
>    getting more than some number of URLs for a single site, eye-ball them
>    to see if you're getting useful data, or if you're just crawling, say,
>    the Python package index database or a human genome database.
>
>  Sean

Adding to what Sean has said . . .

Please provide some sort of contact info in your user agent string. I
handle numerous sites and I'm willing to work with the maintainer of
the "crawler", but if someone is beating up my servers and I can't get
a hold of 'em, I'll send them to the bit bucket real fast!

Also, be mindful of sites that have a lot of dynamically generated
content - needless hits that put a huge load on the db servers will
also get you blacklisted.

Jeff

(Who's also dealing with mis-behaving crawlers)



More information about the LUG mailing list