Unicode 7 was released in June. I read the release news and was intrigued to review various concepts about Unicode and character encoding in general, since such is one of those technical issues that one encounter frequently, usually without appreciating or understanding its full technicality (due to its terseness and complexity), hence not sufficiently carefully taking care of it in general. But, if you're unlucky as every living man will be sometime, it bites back on you and you'll have to pay back the technical debt.
The first few articles I read some years ago on the topic besides the obvious Wikipedia articles, was Joel Spolsky's oft-referenced article 〈The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)〉 which humorously introduced the history, motivation, and basic ideas to the various concepts and practical information about character sets, Unicode and UTF-8. In my native tongue Chinese, RUAN Yifeng (阮一峰) has a reading note article about the topic which explains the basics clearly and succinctly. Xah Lee wrote a series of concise summaries about Unicode characters. (They are also particularly neatly formatted using HTML & CSS.) Some articles summarize interesting subsets of characters such as the arrow characters which I find quite handy as reference.
My short mnemonic on the topic, for now is just the following
few sentences: Unicode is a character set, which is intended to be
a unified set containing characters in all languages (the more
precise term might be writing systems), to solve various technical
difficulties with having different character sets for different
languages and using them in multilingual contexts. In practice, in
the most recently released Unicode 7.0, it has defined 113,021 code
points, i.e. unique characters. A Unicode code point such as
U+1234
is a character uniquely identified by the
hexadecimal number following U+
. Unicode itself does
not specify how characters are represented on computer storage
media as sequences bits, viz. 0s and 1s. UTF-8 is a character
encoding scheme which is a protocol for representing Unicode
points, i.e. the characters as sequences of bits, such as
representing Unicode code point U+00FF
, i.e. character
ÿ
as 1100001110111111
. Conversely, a
piece of text data on computer storage medium, which ultimately is
a sequence of bits cannot be interpreted or decoded, if an
accompanying encoding such as UTF-8 is not given. UTF-8 is an
efficient encoding scheme. Some of its advantages include 1)
backwards compatibility to ASCII so characters in ASCII including
English letters, Arabic numerals, and some regular English
punctuations are represented by the exactly same sequences of bits
in ASCII and UTF-8, thus old or English-language text data encoded
using ASCII can be exactly decoded with UTF-8 as well, which
minimizes compatibility glitches; 2) variable length of bit
sequences for representing individual characters to reduce space
wasted for padding and disambiguation. And the encoding
scheme sketched in the Wikipedia article "UTF-8" is useful to
quickly remind one of the related concepts. I'll see if I can add
more to this in future.
Mathematica, the software that I use all the time in and outside
of my work only support plane-0 Unicode characters (at least in the
front-end, i.e. the notebook interface), that is
U+0000
to U+FFFF
, which unfortunately
misses out many Chinese radicals in classical Chinese texts. I used
to developed some prototypes for natural language processing with
Chinese classical texts in Mathematica, but because of this
limitation, it could not get quite nicely done. Databases is
another context where careful treatment to character
set and character encoding issues and collations can become
involving. My most frequently used database is MySQL, in 5.5+,
there is
utf8mb4
which supports storing 4-byte-wide Unicode
characters which is quite broad.
The ranges of Unicode code points representing Chinese, Japanese and Korean (CJK) characters (as I identified) are
- [
U+4E00
,9FFF
] - [
U+3400
,4DFF
] - [
U+F900
,FAFF
] - [
U+20000
,2A6DF
] - [
U+2F800
,2FA1F
]
Some useful references:
-
http://www.fileformat.info/info/unicode/, e.g. ü
-
http://www.wolframalpha.com, e.g. ü
-
A reference about sorting http://collation-charts.org
-
Unicode.org has some computer-readable data files: http://www.unicode.org/Public/UNIDATA/
And, lastly, the technically precise way to write the two words
is Unicode
and UTF-8
, not
unicode
, utf8
or UTF8
.