[lug] Locales/character encodings, etc...

Sexton, George gsexton at mhsoftware.com
Tue Feb 11 10:03:22 MST 2003


As you have noticed, it's really complicated.

There are two halves to the picture. The first is the language/country
combination.

en is english

en_US is English as spoken in the United States

More formally, en_US is a Locale. Among other things, this specifies the
decimal separator, thousands separator, default currency symbol. In the date
arena, it specifies the starting day of the week, the names of the months,
and the days of the week.

For example, en_NZ is very similar, but the staring day of the week is
Monday, and not Sunday.

The second part of the equation is the character set encoding.

ISO-8859-1 pretty much covers all western European languages.

ISO-8859-5 is Cyrillic/Russian. In general, the first 128 characters of all
character sets are the same. In the ISO series, the language specific
characters are in the top 128 characters. In ISO-8859-1, all of the accented
characters, umlauts, ligatures, etc are in the top 128 characters. In the
one for Hebrew, all of the Hebrew characters are in the top 128.

UTF8 is an 8 bit encoding of Unicode, which is a 16 bit character set.

Here is a good reference:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

One problem with using the ISO character sets is that you may want to
display text that is mutually exclusive. I.E. you might want to display a
web page that has Hebrew, and Russian Cyrillic. Since you can only specify
one encoding of a web page, you are stuck. Using UTF-8, you can have both
types of characters.

George Sexton
MH Software, Inc.
Home of Connect Daily Web Calendar Software
http://www.mhsoftware.com/connectdaily.htm
Voice: 303 438 9585


-----Original Message-----
From: lug-admin at lug.boulder.co.us [mailto:lug-admin at lug.boulder.co.us]On
Behalf Of Harris, James
Sent: 11 February, 2003 8:50 AM
To: Boulder Linux Users Group (lug at lug.boulder.co.us)
Subject: [lug] Locales/character encodings, etc...


Hey all --

I would like to get a grasp on locales/internationalization and character
sets and googling around has left me with that somewhat overwhelmed feeling.
Basically, I really suffer from being a spoiled, isolated American who has
no clue about any language other than English and I want to understand the
implications of locales better in Linux/computing.  Does anyone have any
recommendations for good reading?  I'm not even 100% concerned with it's
direct relation to Linux.  I'd just like to find more out about how all of
this works, on a fundamental level...

For example, I noticed that $LANG in RH 8.0 is now set to en_US.UTF-8.  This
seriously messes up man pages when I try to view them from an xterm on an RH
7.3 machine.  I found that setting LANG to en_US or C or POSIX fixes this.
I also tried launching xterm with '-u8' which supposedly sets 8 bit support,
which helps, but things still aren't 100% right.  Why is that?  What is the
drastic difference between en_US and en_US.UTF-8?

Another example that threw me at home: I've recently ripped a few CDs that
are Icelandic and a few that are Celtic which contain foreign characters in
their titles.  For giggles, I ripped them with "high bit" support in the
file names just to see what it would look like.  To my pleasant surprise,
not only is XMMS completely happy with them, but my portable MP3 player is
also happy.  What blew me away, however, was that any GUI I used to view the
filesystem showed the names fine, but an xterm wouldn't.  Everything was a
bunch of question marks.  This was on my woody system which defaults to 'C',
so I set LANG to POSIX and it didn't change anything.  I then set LANG to
en_US.UTF-8 since I'd seen it on RH 8.0 and the names still looked like
poop.  I then set it to en_US and now everything looks wonderful.  So, as
you can imagine, I'm now really curious to understand all of this.

Thanks all for any resource recommendations you can give me.  Like I said,
this is just interest peaking out of me and nothing terribly important in
the ultimate scheme of things, but I'd sure love to get a better grasp on
it.  ;)

Jim
_______________________________________________
Web Page:  http://lug.boulder.co.us
Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
Join us on IRC: lug.boulder.co.us port=6667 channel=#colug




More information about the LUG mailing list