[lug] ADD-ON to Web crawler advice

Bear Giles bgiles at coyotesong.com
Mon May 5 12:00:54 MDT 2008


George Sexton wrote:
> gordongoldin at aim.com wrote:
>>
>> See question below - can one get only text - to speed up the 
>> text-only search?
>> To get only English - how reliable is the  lang="en" ?
>
> you could spot check, but I'm guessing that 99% of the pages don't set 
> it.
>
> Charset really won't be helpful. I use UTF-8, so there's no telling 
> from it.
>
> I suppose if it's a non US charset like Windows-1255, or ISO-8859-[<>1]
>
> that might be slightly helpful.

All of the ISO-8859-x have the same ASCII subset so that doesn't help.

(Remember that ASCII is a 7-bit code, with the high bit clear when 
pushed into an 8-bit character.  The ISO-8859-x codes are designed as 
extensions of ASCII, not replacements for it.)




More information about the LUG mailing list