[lug] ADD-ON to Web crawler advice

George Sexton gsexton at mhsoftware.com
Mon May 5 15:44:40 MDT 2008



Bear Giles wrote:
> George Sexton wrote:
>> gordongoldin at aim.com wrote:
>>>
>>> See question below - can one get only text - to speed up the 
>>> text-only search?
>>> To get only English - how reliable is the  lang="en" ?
>>
>> you could spot check, but I'm guessing that 99% of the pages don't set 
>> it.
>>
>> Charset really won't be helpful. I use UTF-8, so there's no telling 
>> from it.
>>
>> I suppose if it's a non US charset like Windows-1255, or ISO-8859-[<>1]
>>
>> that might be slightly helpful.
> 
> All of the ISO-8859-x have the same ASCII subset so that doesn't help.

Actually it does. ISO-8859-5 does have the same characters in the low 
set, but it's fair to assume when you see it that the content of the 
page is Hebrew. As you point out, it's not necessarily non-English, but 
anyone creating a web page with that encoding is either used to writing 
Hebrew pages, or has Hebrew on that page...


> 
> (Remember that ASCII is a 7-bit code, with the high bit clear when 
> pushed into an 8-bit character.  The ISO-8859-x codes are designed as 
> extensions of ASCII, not replacements for it.)

I understand character sets pretty well. The real answer is use UTF-8 
and then you don't have to worry about it. If you fool around with the 
ISO-8859- series, then you can't have mixed content on the same page.



> 
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
> 

-- 
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL:   http://www.mhsoftware.com/



More information about the LUG mailing list