[lug] Rodney - a buddy of mine has a book

gordongoldin at aim.com gordongoldin at aim.com
Tue May 6 16:42:52 MDT 2008


I could borrow that if you decide.

-----Original Message-----
From: lug-request at lug.boulder.co.us
To: lug at lug.boulder.co.us
Sent: Mon, 5 May 2008 10:12 pm
Subject: LUG Digest, Vol 55, Issue 5










Send LUG mailing list submissions to
    lug at lug.boulder.co.us

To subscribe or unsubscribe via the World Wide Web, visit
    http://lists.lug.boulder.co.us/mailman/listinfo/lug
or, via email, send a message with subject or body 'help' to
    lug-request at lug.boulder.co.us

You can reach the person managing the list at
    lug-owner at lug.boulder.co.us

When replying, please edit your Subject line so it is more specific
than "Re: Contents of LUG digest..."


Today's Topics:

   1. Re: Web crawler advice (Jeffrey Haemer)
   2. Re: Web crawler advice (Nate Duehr)
   3. Re: Web crawler advice (George Sexton)
   4. Re: Web crawler advice (Jeffrey Haemer)
   5. Upcoming Installfest (bclarkinco at juno.com)
   6. Re: Web crawler advice (Sean Reifschneider)
   7. Re: ADD-ON to Web crawler advice (Bear Giles)
   8. Re: ADD-ON to Web crawler advice (George Sexton)
   9. Re: Web crawler advice (Nate Duehr)
  10. Re: Web crawler advice (George Sexton)
  11. Re: Web crawler advice (karl horlen)
  12. Re: Web crawler advice (Jeff Falgout)
  13. Re: Web crawler advice (Bear Giles)
  14. Re: Web crawler advice (George Sexton)
  15. Re: ADD-ON to Web crawler advice (Bear Giles)
  16. Re: Web crawler advice (Nate Duehr)


----------------------------------------------------------------------

Message: 1
Date: Mon, 5 May 2008 10:26:11 -0600
From: "Jeffrey Haemer" <jeffrey.haemer at gmail.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID:
    <5808d4420805050926y358ef070ne44d357deae2ff32 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Gordon,

I have an ORA book on web spidering that you can probably cannibalize useful
stuff from.  If you're coming to the BLUG talk this Thursday, and want to
borrow it, let me know and I'll bring it.

On Mon, May 5, 2008 at 10:18 AM, <gordongoldin at aim.com> wrote:

>  I'm doing a project to analyze text content on the web:
>
> i need to:
>
> start with a list of URLs
> for each URL in the URL list
>    fetch the page
>    throw away non-English pages
>    extract the sentence text content, (not hidden text, menus, lists,
> etc.)
>       write that content to a file
>    extract all the links
>       add just the new links to the URL list (not those already in the
> list of URLs)
>
> i could just use java, but then i would have to write everything.
> beautiful soup (written in python) would probably work well to parse the
> pages, but i don't see that it can fetch pages.
> i can't tell to what extent nutch can parse the pages. i know it can give
> me the links, but i don't know if it can extract just the text i care about.
>
>
>
> Gordon Golding
>
>  ------------------------------
> Plan your next roadtrip with MapQuest.com<http://www.mapquest.com/?ncid=mpqmap00030000000004>:
> America's #1 Mapping Site.
>
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>



-- 
Jeffrey Haemer <jeffrey.haemer at gmail.com>
720-837-8908 [cell]
http://goyishekop.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.community.tummy.com/pipermail/lug/attachments/20080505/d7b1e85c/attachment.html

------------------------------

Message: 2
Date: Mon, 05 May 2008 12:22:56 -0600
From: Nate Duehr <nate at natetech.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <481F5080.1080207 at natetech.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

George Sexton wrote:

> OTOH, threading in Java is dead easy, and this kind of app would benefit 
> from multi-threading.

Dead-easy until it blows up.  :-)

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html

Threading is starting to get as bad a rap as "goto" got in the 70s.

The author specifically talks about how subtle problems will crop up on 
multi-core machines, especially.

Had something similar lately.  The symptom was that Perl wouldn't start 
on a 4-processor Sun box.

Perl (for some UNHOLY reason) uses floating-point math to compare the 
main perl version number with the version numbers in any modules it 
loads at run-time.

What had happened was that the FPU in CPU #3 on the box was flaky. 
Since it was running very little else that required floating-point 
calculations, the only "symptom" was, "Perl won't run consistently, or 
dies halfway through scripts!"  (The scripts that were dying were 
loading more modules.

Frackin' ugly troubleshooting session that was... until we "caught" the 
FPU doing naughty things with Sun's hardware test tools.

I shudder to think how long that would have taken on PeeCee hardware 
where such test tools simply don't (really) exist on most hardware/OS 
combinations.

Nate


------------------------------

Message: 3
Date: Mon, 05 May 2008 12:23:55 -0600
From: George Sexton <gsexton at mhsoftware.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <481F50BB.7010005 at mhsoftware.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Writing multi-threaded code takes attention to detail that is beyond the 
ability of some programmers. It doesn't mean it's not useful. What's the 
point of having a nice multi-core machine if you're not using them?

It's still easier in Java than about anything else.

Nate Duehr wrote:
> George Sexton wrote:
> 
>> OTOH, threading in Java is dead easy, and this kind of app would 
>> benefit from multi-threading.
> 
> Dead-easy until it blows up.  :-)
> 
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html
> 
> Threading is starting to get as bad a rap as "goto" got in the 70s.
> 
> The author specifically talks about how subtle problems will crop up on 
> multi-core machines, especially.
> 
> Had something similar lately.  The symptom was that Perl wouldn't start 
> on a 4-processor Sun box.
> 
> Perl (for some UNHOLY reason) uses floating-point math to compare the 
> main perl version number with the version numbers in any modules it 
> loads at run-time.
> 
> What had happened was that the FPU in CPU #3 on the box was flaky. Since 
> it was running very little else that required floating-point 
> calculations, the only "symptom" was, "Perl won't run consistently, or 
> dies halfway through scripts!"  (The scripts that were dying were 
> loading more modules.
> 
> Frackin' ugly troubleshooting session that was... until we "caught" the 
> FPU doing naughty things with Sun's hardware test tools.
> 
> I shudder to think how long that would have taken on PeeCee hardware 
> where such test tools simply don't (really) exist on most hardware/OS 
> combinations.
> 
> Nate
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
> 

-- 
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL:   http://www.mhsoftware.com/


------------------------------

Message: 4
Date: Mon, 5 May 2008 13:17:07 -0600
From: "Jeffrey Haemer" <jeffrey.haemer at gmail.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID:
    <5808d4420805051217j20623e79x9df3633c160a8739 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

> Perl (for some UNHOLY reason) uses floating-point math to compare the main
> perl version number with the version numbers in any modules it loads at
> run-time.


In that vein, awk uses floating point for all its arithmetic.  It's also
interpreted, so a simple program like this

 awk 'BEGIN {for (i=0; i<10000; i++) print i }'

requires about a jillion conversions between floats and ints.  (I think the
precise number is pi jillion.)

Long ago, I watched Mark Rochkind run this very program, as a benchmark,
ask, "Why's this so sloooow?" and then smack his forehead; the box he was
running it on had no floating-point processor.

-- 
Jeffrey Haemer <jeffrey.haemer at gmail.com>
720-837-8908 [cell]
http://goyishekop.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.community.tummy.com/pipermail/lug/attachments/20080505/1a50c6c3/attachment.htm

------------------------------

Message: 5
Date: Mon, 5 May 2008 18:58:45 GMT
From: "bclarkinco at juno.com" <bclarkinco at juno.com>
Subject: [lug] Upcoming Installfest
To: lug at lug.boulder.co.us
Message-ID: <20080505.125845.28391.0 at webmail11.dca.untd.com>
Content-Type: text/plain; charset="windows-1252"

Hello,
I'm a Linux newbie planning to attend the upcoming InstallFest.  I am looking 
for help installing the Ubuntu Hardy Heron LTS release on a box already running 
WinXP Home.  My goal is to have the HD partitioned into thirds, where each of 
the OS's has a partition and the remaining can be read and written to by each.  
The present setup contains no data requiring backup.
I also have a Compaq laptop running Vista (ugh!) that I bought last July, and if 
it is possible I would like to set up Linux as described above on that machine 
too.  Again, no data is present requiring backup.
I live in Lafayette and would be happy to carpool with anyone nearby or en 
route; I can be driver or passenger.  If interested, please email me at 
bclarkinco at juno.com, or call 303-666-6449.
Thanks!
Brian Clark

_____________________________________________________________
Click here for free info on Graduate Degrees.
http://thirdpartyoffers.juno.com/TGL2121/fc/Ioyw6i3nNPPKN83MrQMLdQKs8L7RgToMkrdi7SWPDQBJmR44TGs43f/?count=1234567890
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.community.tummy.com/pipermail/lug/attachments/20080505/ff62d355/attachment.html

------------------------------

Message: 6
Date: Mon, 05 May 2008 14:41:33 -0600
From: Sean Reifschneider <jafo at tummy.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <481F70FD.2030801 at tummy.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

gordongoldin at aim.com wrote:
 > pages, but i don't see that it can fetch pages.

import urllib2
pagedata = urllib2.urlopen(url).read()

However, as someone who regularly has to deal with the fallout of poorly
behaving web crawlers I would like to say:

    Be sure to honor the robots.txt

    Please rate-limit the number of pages per second you get from particular
    sites.  Just because you can grab 100 URLs in parallel doesn't mean the
    server can do that without causing other users sessions to slow to a
    crawl.

    Be careful about the number of pages you get from a site.  If you start
    getting more than some number of URLs for a single site, eye-ball them
    to see if you're getting useful data, or if you're just crawling, say,
    the Python package index database or a human genome database.

Sean
-- 
Sean Reifschneider, Member of Technical Staff <jafo at tummy.com>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability


------------------------------

Message: 7
Date: Mon, 05 May 2008 12:00:54 -0600
From: Bear Giles <bgiles at coyotesong.com>
Subject: Re: [lug] ADD-ON to Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <481F4B56.70401 at coyotesong.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

George Sexton wrote:
> gordongoldin at aim.com wrote:
>>
>> See question below - can one get only text - to speed up the 
>> text-only search?
>> To get only English - how reliable is the  lang="en" ?
>
> you could spot check, but I'm guessing that 99% of the pages don't set 
> it.
>
> Charset really won't be helpful. I use UTF-8, so there's no telling 
> from it.
>
> I suppose if it's a non US charset like Windows-1255, or ISO-8859-[<>1]
>
> that might be slightly helpful.

All of the ISO-8859-x have the same ASCII subset so that doesn't help.

(Remember that ASCII is a 7-bit code, with the high bit clear when 
pushed into an 8-bit character.  The ISO-8859-x codes are designed as 
extensions of ASCII, not replacements for it.)



------------------------------

Message: 8
Date: Mon, 05 May 2008 15:44:40 -0600
From: George Sexton <gsexton at mhsoftware.com>
Subject: Re: [lug] ADD-ON to Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <481F7FC8.7000900 at mhsoftware.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed



Bear Giles wrote:
> George Sexton wrote:
>> gordongoldin at aim.com wrote:
>>>
>>> See question below - can one get only text - to speed up the 
>>> text-only search?
>>> To get only English - how reliable is the  lang="en" ?
>>
>> you could spot check, but I'm guessing that 99% of the pages don't set 
>> it.
>>
>> Charset really won't be helpful. I use UTF-8, so there's no telling 
>> from it.
>>
>> I suppose if it's a non US charset like Windows-1255, or ISO-8859-[<>1]
>>
>> that might be slightly helpful.
> 
> All of the ISO-8859-x have the same ASCII subset so that doesn't help.

Actually it does. ISO-8859-5 does have the same characters in the low 
set, but it's fair to assume when you see it that the content of the 
page is Hebrew. As you point out, it's not necessarily non-English, but 
anyone creating a web page with that encoding is either used to writing 
Hebrew pages, or has Hebrew on that page...


> 
> (Remember that ASCII is a 7-bit code, with the high bit clear when 
> pushed into an 8-bit character.  The ISO-8859-x codes are designed as 
> extensions of ASCII, not replacements for it.)

I understand character sets pretty well. The real answer is use UTF-8 
and then you don't have to worry about it. If you fool around with the 
ISO-8859- series, then you can't have mixed content on the same page.



> 
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
> 

-- 
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL:   http://www.mhsoftware.com/


------------------------------

Message: 9
Date: Mon, 05 May 2008 15:56:51 -0600
From: Nate Duehr <nate at natetech.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <481F82A3.4000401 at natetech.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Sean Reifschneider wrote:
> gordongoldin at aim.com wrote:
>  > pages, but i don't see that it can fetch pages.
> 
> import urllib2
> pagedata = urllib2.urlopen(url).read()
> 
> However, as someone who regularly has to deal with the fallout of poorly
> behaving web crawlers I would like to say:
> 
>    Be sure to honor the robots.txt
> 
>    Please rate-limit the number of pages per second you get from particular
>    sites.  Just because you can grab 100 URLs in parallel doesn't mean the
>    server can do that without causing other users sessions to slow to a
>    crawl.
> 
>    Be careful about the number of pages you get from a site.  If you start
>    getting more than some number of URLs for a single site, eye-ball them
>    to see if you're getting useful data, or if you're just crawling, say,
>    the Python package index database or a human genome database.

Thanks for mentioning this Sean, there are some idiots (er, admins) over 
at Yahoo I would love to strangle... for not doing "sane" behavior along 
these lines.

MySpace and people deep-linking to content off-site is really annoying 
on busy pages on their site too, but that's easily handled with a 
rewrite rule to send them off to REALLY nasty photos (if I'm in a bad 
mood) so they'll stop using me as their "image host", by linking to only 
the images in my content and then loading 100 copies of it every time 
some moron hits refresh on a MySpace page where some doofus has used my 
images in their "avatar".

Nate


------------------------------

Message: 10
Date: Mon, 05 May 2008 16:04:07 -0600
From: George Sexton <gsexton at mhsoftware.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <481F8457.5010203 at mhsoftware.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed



Nate Duehr wrote:
> MySpace and people deep-linking to content off-site is really annoying 
> on busy pages on their site too, but that's easily handled with a 
> rewrite rule to send them off to REALLY nasty photos (if I'm in a bad 
> mood) so they'll stop using me as their "image host", by linking to only 
> the images in my content and then loading 100 copies of it every time 
> some moron hits refresh on a MySpace page where some doofus has used my 
> images in their "avatar".

Someone on the newsgroup alt.www.webmaster used mod_rewrite to have the 
image redirected to a graphic saying "I LIKE LITTLE BOYS" when it was 
linked from MySpace UNLESS the person viewing was the poster.

So, everyone but that person saw the wrong graphic.

-- 
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL:   http://www.mhsoftware.com/


------------------------------

Message: 11
Date: Mon, 5 May 2008 16:18:35 -0700 (PDT)
From: karl horlen <horlenkarl at yahoo.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <780330.72422.qm at web58907.mail.re1.yahoo.com>
Content-Type: text/plain; charset=us-ascii

Can you say more about how you detect that people are leeching your site content 
and how you prevent it.  For instance what specific rewrite rules or other 
techniques do you use to help defeat this type of behavior?

Do you automate the leech detection?  I'd think it would be pretty tedious to 
periodically manually inspect the logs looking for this type of behavior.    Do 
you have a cron script that periodically checks for certain logfile entries?  If 
so would you mind sharing some of it or some techniques used to detect the rogue 
hits?

Finally. Is there any way that one could "inject" "id info" in site content / 
pages and then later do a google search with those "id tags" to see if any other 
site pages have been spidered under those id tags?  I'm thinking that if you 
injected a really unique id tag in the html code, like an element attribute that 
wouldn't be displayed, it might actually get flagged by google.  Just a thought?

Thanks


> MySpace and people deep-linking to content off-site is
> really annoying 
> on busy pages on their site too, but that's easily
> handled with a 
> rewrite rule to send them off to REALLY nasty photos (if
> I'm in a bad 
> mood) so they'll stop using me as their "image
> host", by linking to only 
> the images in my content and then loading 100 copies of it
> every time 
> some moron hits refresh on a MySpace page where some doofus
> has used my 
> images in their "avatar".
> 
> Nate
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List:
> http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


------------------------------

Message: 12
Date: Mon, 5 May 2008 17:32:59 -0600
From: "Jeff Falgout" <jtfalgout at gmail.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID:
    <bf3f7bff0805051632k111953b5ndb57b0ec6543d92b at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Mon, May 5, 2008 at 2:41 PM, Sean Reifschneider <jafo at tummy.com> wrote:

>  However, as someone who regularly has to deal with the fallout of poorly
>  behaving web crawlers I would like to say:
>
>    Be sure to honor the robots.txt
>
>    Please rate-limit the number of pages per second you get from particular
>    sites.  Just because you can grab 100 URLs in parallel doesn't mean the
>    server can do that without causing other users sessions to slow to a
>    crawl.
>
>    Be careful about the number of pages you get from a site.  If you start
>    getting more than some number of URLs for a single site, eye-ball them
>    to see if you're getting useful data, or if you're just crawling, say,
>    the Python package index database or a human genome database.
>
>  Sean

Adding to what Sean has said . . .

Please provide some sort of contact info in your user agent string. I
handle numerous sites and I'm willing to work with the maintainer of
the "crawler", but if someone is beating up my servers and I can't get
a hold of 'em, I'll send them to the bit bucket real fast!

Also, be mindful of sites that have a lot of dynamically generated
content - needless hits that put a huge load on the db servers will
also get you blacklisted.

Jeff

(Who's also dealing with mis-behaving crawlers)


------------------------------

Message: 13
Date: Mon, 05 May 2008 19:06:26 -0600
From: Bear Giles <bgiles at coyotesong.com>
Subject: Re: [lug] Web crawler advice
To: horlenkarl at yahoo.com,   "Boulder (Colorado) Linux Users Group --
    General Mailing List"   <lug at lug.boulder.co.us>
Message-ID: <481FAF12.9080705 at coyotesong.com>
Content-Type: text/plain; charset=us-ascii; format=flowed

karl horlen wrote:
> Can you say more about how you detect that people are leeching your site 
content and how you prevent it.  For instance what specific rewrite rules or 
other techniques do you use to help defeat this type of behavior?
>   
One standard technique is to look at the REFERER (sic) header. It 
contains the URL of the page referring to the graphic/page/whatever. 
Like all headers it's trivially manipulated by a knowledgeable person, 
but it's a good approach for the casual user.

It's a little confusing at first. Say you're "pooh at woods.com" and you 
visit the page "badbear.com/lunch.html" that contains a link to the 
image honeypot.com/daisy.jpg. The server at honeypot.com will see a 
"remote addr" of woods.com and a REFERER header of 'badbear.com/lunch.html"

It can then decide what to do. Many sites block deep linking by checking 
the REFERER and blocking queries from outside of its own domain. More 
casual approaches would redirect queries with a REFERER link from 
specific blacklisted domains.

> Do you automate the leech detection?  I'd think it would be pretty tedious to 
periodically manually inspect the logs looking for this type of behavior.    Do 
you have a cron script that periodically checks for certain logfile entries?  If 
so would you mind sharing some of it or some techniques used to detect the rogue 
hits?
>
> Finally. Is there any way that one could "inject" "id info" in site content / 
pages and then later do a google search with those "id tags" to see if any other 
site pages have been spidered under those id tags?  I'm thinking that if you 
injected a really unique id tag in the html code, like an element attribute that 
wouldn't be displayed, it might actually get flagged by google.  Just a thought?
>   


------------------------------

Message: 14
Date: Mon, 05 May 2008 19:34:40 -0600
From: George Sexton <gsexton at mhsoftware.com>
Subject: Re: [lug] Web crawler advice
To: horlenkarl at yahoo.com,   "Boulder (Colorado) Linux Users Group --
    General Mailing List"   <lug at lug.boulder.co.us>
Message-ID: <481FB5B0.4070002 at mhsoftware.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Goto:

http://www.aww-faq.org/#quickanswers

and read "How can I stop someone from hot-linking to my images?"

karl horlen wrote:
> Can you say more about how you detect that people are leeching your site 
content and how you prevent it.  For instance what specific rewrite rules or 
other techniques do you use to help defeat this type of behavior?
> 
> Do you automate the leech detection?  I'd think it would be pretty tedious to 
periodically manually inspect the logs looking for this type of behavior.    Do 
you have a cron script that periodically checks for certain logfile entries?  If 
so would you mind sharing some of it or some techniques used to detect the rogue 
hits?
> 
> Finally. Is there any way that one could "inject" "id info" in site content / 
pages and then later do a google search with those "id tags" to see if any other 
site pages have been spidered under those id tags?  I'm thinking that if you 
injected a really unique id tag in the html code, like an element attribute that 
wouldn't be displayed, it might actually get flagged by google.  Just a thought?
> 
> Thanks
> 
> 
>> MySpace and people deep-linking to content off-site is
>> really annoying 
>> on busy pages on their site too, but that's easily
>> handled with a 
>> rewrite rule to send them off to REALLY nasty photos (if
>> I'm in a bad 
>> mood) so they'll stop using me as their "image
>> host", by linking to only 
>> the images in my content and then loading 100 copies of it
>> every time 
>> some moron hits refresh on a MySpace page where some doofus
>> has used my 
>> images in their "avatar".
>>
>> Nate
>> _______________________________________________
>> Web Page:  http://lug.boulder.co.us
>> Mailing List:
>> http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
> 

-- 
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL:   http://www.mhsoftware.com/


------------------------------

Message: 15
Date: Mon, 05 May 2008 19:17:35 -0600
From: Bear Giles <bgiles at coyotesong.com>
Subject: Re: [lug] ADD-ON to Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <481FB1AF.5040109 at coyotesong.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

George Sexton wrote:
> Bear Giles wrote:
>> All of the ISO-8859-x have the same ASCII subset so that doesn't help.
>
> Actually it does. ISO-8859-5 does have the same characters in the low 
> set...
That's what I said,although maybe it wasn't clear that I was referring 
"to the subset that is the ASCII character set" instead of a subset of 
those characters.
> but it's fair to assume when you see it that the content of the page 
> is Hebrew. As you point out, it's not necessarily non-English, but 
> anyone creating a web page with that encoding is either used to 
> writing Hebrew pages, or has Hebrew on that page...
It's suggestive, but no Monty Hall.  Fortunately it's trivial to filter 
-- simply replace anything with the high bit set with a space.  Anything 
with a clear high bit is in the Latin alphabet.


------------------------------

Message: 16
Date: Mon, 5 May 2008 22:11:44 -0600
From: Nate Duehr <nate at natetech.com>
Subject: Re: [lug] Web crawler advice
To: "Boulder (Colorado) Linux Users Group -- General Mailing List"
    <lug at lug.boulder.co.us>
Message-ID: <2A73D08A-009B-40D2-9EC0-14C4B967C825 at natetech.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes


On May 5, 2008, at 7:06 PM, Bear Giles wrote:

> karl horlen wrote:
>> Can you say more about how you detect that people are leeching your  
>> site content and how you prevent it.  For instance what specific  
>> rewrite rules or other techniques do you use to help defeat this  
>> type of behavior?
>>
> One standard technique is to look at the REFERER (sic) header. It  
> contains the URL of the page referring to the graphic/page/whatever.  
> Like all headers it's trivially manipulated by a knowledgeable  
> person, but it's a good approach for the casual user.
>
> It's a little confusing at first. Say you're "pooh at woods.com" and  
> you visit the page "badbear.com/lunch.html" that contains a link to  
> the image honeypot.com/daisy.jpg. The server at honeypot.com will  
> see a "remote addr" of woods.com and a REFERER header of  
> 'badbear.com/lunch.html"
>
> It can then decide what to do. Many sites block deep linking by  
> checking the REFERER and blocking queries from outside of its own  
> domain. More casual approaches would redirect queries with a REFERER  
> link from specific blacklisted domains.


Yep, that's how I found it.  I could care less about "casual" deep- 
linking to my personal site, but when you're getting bombarded by the  
crappy MySpace stuff (and the browser sends the REFERRER stuff  
correctly) it's pretty obvious... the web server logs are pounded.

I've since sent not only myspace referrals but also blogspot and  
livejournal to the bit-bucket.  Could care less if people linking from  
those sites see what they want to see on my pages.

I even had a guy COMPLAIN that he had been SELLING people "custom  
MySpace pages" that included deep-links to my site, and that I had  
"broke" them.  What a tard.

I suppose I could have turned that into an opportunity of some kind,  
but I just replied saying he was welcome to find the same funny photos  
and things I had on my webserver out on the net and host them on his  
own webservers to deal with the crushing load he'd put on a box on a  
residential connection, that was never meant to service half of the  
world's MySpace teenie boppers saying, "Dude - UR sooo HOOTTT!" to  
some girl they don't know.

I have stuff I don't even know for sure is not copyrighted, up on the  
blog... I would never make a buck on any of it.  It's just posted as a  
"ha-ha funny" type of thing on my blog pages and I always copy it down  
(to save their server from load) and give credit for where it was  
"found" with a link, if it wasn't e-mailed to me.

Anyway... since someone else shared, I redirect them to this:

<http://publishing2.com/images/LostCherry%20MySpace%20Sucks.gif>

[Of course, publishing2 appears to have problems of their own...]

<http://publishing2.com/images>

And the graphic comes from this article:

http://publishing2.com/2006/06/13/lostcherry-takes-aim-at-myspace/

Where there's bitching about MySpace, talk of some anti-MySpace site  
called "LostCherry", and then even more bitching about Digg "burying"  
the "Lost Cherry Story"...

Basically, I redirect the cesspool back to the cesspool, I figure.    
Plus it just continues the "controversy chain" ad-nauseam.  Might as  
well.  These sites love this kind of crap.  More traffic to claim to  
their advertisers, else they wouldn't have a business model.

The ADD Poster Children who don't understand HTML or browsers who want  
to "investigate" why they're getting a "new" graphic some way they  
don't understand, end up chasing around wondering who publishing2 is,  
find the article, and say "ooh, shiny!" and dive into the comment  
sections of publishing2, LostCherry, MySpace and Digg to continue the  
bitch-fest.

Probably, anyway...

Of course, it's a never-ending game.  I wonder how many rewrites from  
Apache a browser will follow before it gives up.  Might be fun to  
redirect to a pool of high-bandwidth servers in a circular rewrite,  
where one hands to the other, which hands to a third, which hands back  
to the original... but I'm not THAT evil.  If the browsers don't stop  
the chain, and I bet they don't... you could probably lock up  
someone's browser bad enough that they would have to close all of  
their tabs and start over.  Imagine that happening in an image link on  
some doofuses MySpace page.

Game over.  He who dies with the most bandwidth wins.

--
Nate Duehr
nate at natetech.com





------------------------------

_______________________________________________
LUG mailing list
LUG at lug.boulder.co.us
http://lists.lug.boulder.co.us/mailman/listinfo/lug


End of LUG Digest, Vol 55, Issue 5
**********************************



 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20080506/c35b95aa/attachment.html>


More information about the LUG mailing list