[lug] ADD-ON to Web crawler advice
Bear Giles
bgiles at coyotesong.com
Mon May 5 19:17:35 MDT 2008
George Sexton wrote:
> Bear Giles wrote:
>> All of the ISO-8859-x have the same ASCII subset so that doesn't help.
>
> Actually it does. ISO-8859-5 does have the same characters in the low
> set...
That's what I said,although maybe it wasn't clear that I was referring
"to the subset that is the ASCII character set" instead of a subset of
those characters.
> but it's fair to assume when you see it that the content of the page
> is Hebrew. As you point out, it's not necessarily non-English, but
> anyone creating a web page with that encoding is either used to
> writing Hebrew pages, or has Hebrew on that page...
It's suggestive, but no Monty Hall. Fortunately it's trivial to filter
-- simply replace anything with the high bit set with a space. Anything
with a clear high bit is in the Latin alphabet.
More information about the LUG
mailing list