[lug] ADD-ON to Web crawler advice
Bear Giles
bgiles at coyotesong.com
Mon May 5 12:00:54 MDT 2008
George Sexton wrote:
> gordongoldin at aim.com wrote:
>>
>> See question below - can one get only text - to speed up the
>> text-only search?
>> To get only English - how reliable is the lang="en" ?
>
> you could spot check, but I'm guessing that 99% of the pages don't set
> it.
>
> Charset really won't be helpful. I use UTF-8, so there's no telling
> from it.
>
> I suppose if it's a non US charset like Windows-1255, or ISO-8859-[<>1]
>
> that might be slightly helpful.
All of the ISO-8859-x have the same ASCII subset so that doesn't help.
(Remember that ASCII is a 7-bit code, with the high bit clear when
pushed into an 8-bit character. The ISO-8859-x codes are designed as
extensions of ASCII, not replacements for it.)
More information about the LUG
mailing list