[lug] Google (and other) ftp crawlers.

Fri Feb 25 14:43:07 MST 2011

Well of course the crawler can deny it, its only utilized by 'polite'
crawlers.

However, you can filter requests for multiple large files from your web
server, or add a stipulation that there be a time limit between requests for
large files, limiting what can be requested and how

On Fri, Feb 25, 2011 at 2:40 PM, Bear Giles <bgiles at coyotesong.com> wrote:

> It doesn't /deny/ anything, it's just a hint and many poorly-written
> crawlers will ignore it.
>
> BTW the intent of the file is to identify dynamic content that's pointless
> to crawl. E.g., a cached copy of "current weather" will be pretty useless in
> a week. It's also a good to mark large files that are available elsewhere,
> e.g., there's no point in downloading and caching my Ubuntu .iso images
> since they're widely available elsewhere.
>
> A lot of people think it can be used to protect sensitive information and
> that was never its intent. I will leave it to your imagination whether it
> can be productive to troll those files to see if anyone has highlighted 'the
> juicy bits' from that misunderstanding.
>
>
> On Fri, Feb 25, 2011 at 1:49 PM, Stephen Kraus <ub3ratl4sf00 at gmail.com>wrote:
>
>> Yes, there is a config file you can add that denies crawlers the right to
>> access and download stuff from your server via Robots.txt
>>
>> On Fri, Feb 25, 2011 at 1:46 PM, Dave Pitts <dpitts at cozx.com> wrote:
>>
>>> Hello:
>>>
>>> Is there a way to get Google (and other) sites to stop crawling through
>>> my anon
>>> ftp site? They download everything and drop my network access to a
>>> crawl....
>>> sometimes causing my applications to timeout and die.
>>>
>>> Thanks in advance.
>>>
>>> --
>>> Dave Pitts             PULLMAN: Travel and sleep in safety and comfort.
>>> dpitts at cozx.com        My other RV IS a Pullman (Colorado Pine).
>>> http://www.cozx.com
>>>
>>> _______________________________________________
>>> Web Page:  http://lug.boulder.co.us
>>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>>>
>>
>>
>> _______________________________________________
>> Web Page:  http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>>
>
>
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20110225/b661d8e6/attachment.html>