[lug] ADD-ON to Web crawler advice

Bear Giles bgiles at coyotesong.com
Mon May 5 19:17:35 MDT 2008


George Sexton wrote:
> Bear Giles wrote:
>> All of the ISO-8859-x have the same ASCII subset so that doesn't help.
>
> Actually it does. ISO-8859-5 does have the same characters in the low 
> set...
That's what I said,although maybe it wasn't clear that I was referring 
"to the subset that is the ASCII character set" instead of a subset of 
those characters.
> but it's fair to assume when you see it that the content of the page 
> is Hebrew. As you point out, it's not necessarily non-English, but 
> anyone creating a web page with that encoding is either used to 
> writing Hebrew pages, or has Hebrew on that page...
It's suggestive, but no Monty Hall.  Fortunately it's trivial to filter 
-- simply replace anything with the high bit set with a space.  Anything 
with a clear high bit is in the Latin alphabet.



More information about the LUG mailing list