[lug] Web crawler advice
Jeff Falgout
jtfalgout at gmail.com
Mon May 5 17:32:59 MDT 2008
On Mon, May 5, 2008 at 2:41 PM, Sean Reifschneider <jafo at tummy.com> wrote:
> However, as someone who regularly has to deal with the fallout of poorly
> behaving web crawlers I would like to say:
>
> Be sure to honor the robots.txt
>
> Please rate-limit the number of pages per second you get from particular
> sites. Just because you can grab 100 URLs in parallel doesn't mean the
> server can do that without causing other users sessions to slow to a
> crawl.
>
> Be careful about the number of pages you get from a site. If you start
> getting more than some number of URLs for a single site, eye-ball them
> to see if you're getting useful data, or if you're just crawling, say,
> the Python package index database or a human genome database.
>
> Sean
Adding to what Sean has said . . .
Please provide some sort of contact info in your user agent string. I
handle numerous sites and I'm willing to work with the maintainer of
the "crawler", but if someone is beating up my servers and I can't get
a hold of 'em, I'll send them to the bit bucket real fast!
Also, be mindful of sites that have a lot of dynamically generated
content - needless hits that put a huge load on the db servers will
also get you blacklisted.
Jeff
(Who's also dealing with mis-behaving crawlers)
More information about the LUG
mailing list