Random notes about Unicode

←	March 2021
S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

blog	11
computing	10
note	8
programming	3
tip	3
Java	2
bash	2
china	2
chinese	2
data	2
emacs	2
git	2
journal	2
linguistics	2
mathematica	2
mathematics	2
mercurial	2
news	2
revision control	2
statistics	2

Unicode 7 was released in June. I read the release news and was intrigued to review various concepts about Unicode and character encoding in general, since such is one of those technical issues that one encounter frequently, usually without appreciating or understanding its full technicality (due to its terseness and complexity), hence not sufficiently carefully taking care of it in general. But, if you're unlucky as every living man will be sometime, it bites back on you and you'll have to pay back the technical debt.

The first few articles I read some years ago on the topic besides the obvious Wikipedia articles, was Joel Spolsky's oft-referenced article 〈The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)〉 which humorously introduced the history, motivation, and basic ideas to the various concepts and practical information about character sets, Unicode and UTF-8. In my native tongue Chinese, RUAN Yifeng (阮一峰) has a reading note article about the topic which explains the basics clearly and succinctly. Xah Lee wrote a series of concise summaries about Unicode characters. (They are also particularly neatly formatted using HTML & CSS.) Some articles summarize interesting subsets of characters such as the arrow characters which I find quite handy as reference.

My short mnemonic on the topic, for now is just the following few sentences: Unicode is a character set, which is intended to be a unified set containing characters in all languages (the more precise term might be writing systems), to solve various technical difficulties with having different character sets for different languages and using them in multilingual contexts. In practice, in the most recently released Unicode 7.0, it has defined 113,021 code points, i.e. unique characters. A Unicode code point such as U+1234 is a character uniquely identified by the hexadecimal number following U+. Unicode itself does not specify how characters are represented on computer storage media as sequences bits, viz. 0s and 1s. UTF-8 is a character encoding scheme which is a protocol for representing Unicode points, i.e. the characters as sequences of bits, such as representing Unicode code point U+00FF, i.e. character ÿ as 1100001110111111. Conversely, a piece of text data on computer storage medium, which ultimately is a sequence of bits cannot be interpreted or decoded, if an accompanying encoding such as UTF-8 is not given. UTF-8 is an efficient encoding scheme. Some of its advantages include 1) backwards compatibility to ASCII so characters in ASCII including English letters, Arabic numerals, and some regular English punctuations are represented by the exactly same sequences of bits in ASCII and UTF-8, thus old or English-language text data encoded using ASCII can be exactly decoded with UTF-8 as well, which minimizes compatibility glitches; 2) variable length of bit sequences for representing individual characters to reduce space wasted for padding and disambiguation. And the encoding scheme sketched in the Wikipedia article "UTF-8" is useful to quickly remind one of the related concepts. I'll see if I can add more to this in future.

Mathematica, the software that I use all the time in and outside of my work only support plane-0 Unicode characters (at least in the front-end, i.e. the notebook interface), that is U+0000 to U+FFFF, which unfortunately misses out many Chinese radicals in classical Chinese texts. I used to developed some prototypes for natural language processing with Chinese classical texts in Mathematica, but because of this limitation, it could not get quite nicely done. Databases is another context where careful treatment to character set and character encoding issues and collations can become involving. My most frequently used database is MySQL, in 5.5+, there is utf8mb4 which supports storing 4-byte-wide Unicode characters which is quite broad.

The ranges of Unicode code points representing Chinese, Japanese and Korean (CJK) characters (as I identified) are

[U+4E00, 9FFF]
[U+3400, 4DFF]
[U+F900, FAFF]
[U+20000, 2A6DF]
[U+2F800, 2FA1F]

Some useful references:

http://www.fileformat.info/info/unicode/, e.g. ü
http://www.wolframalpha.com, e.g. ü
A reference about sorting http://collation-charts.org
Unicode.org has some computer-readable data files: http://www.unicode.org/Public/UNIDATA/

And, lastly, the technically precise way to write the two words is Unicode and UTF-8, not unicode, utf8 or UTF8.