# Tag Archives: unicode

Anything related to Unicode

# CLDR 1.8 released

On the 17th the Common Language Data Runtime project at Unicode released version 1.8 of the CLDR.

CLDR 1.8 contains data for 186 languages and 159 territories: 501 locales in all. Version 1.8 of the repository contains over 22% more locale data than the previous release, with over 42,000 new or modified data items from over 300 different contributors.

# CLDR 1.8 data submission closing

The data submission phase for CLDR 1.8 should be closed by now (although the survey tool still says it’s accepting submissions). For Dutch (nl_NL), I’ve been going over quite some items together with the Apple contributor and someone else, so expect quite some improvements on that area. The current release date is aimed at somewhere in March 2010.

# Python’s sys.stdout loses encoding

When you use Python with sys.stdout you might run into a problem where sys.stdout.encoding suddenly becomes None. This happens due to the fact that upon using a pipe or redirection, at least under Unix, it falls back to not knowing anything about the target. In order to work around this you can add a fallback to use locale.getpreferredencoding(). So if you use encode() on a string you can do something like:

from locale import getpreferredencoding   text = u"Something special"   print text.encode(sys.stdout.encoding or getpreferredencoding() or 'ascii', 'replace')

This is how we currently use it within Babel as well for printing the locale list.

# The Beauty of Irony

I needed to look up something within a XHTML specification over at the W3 Consortium website. So I went to the XHTML2 Working Group Home Page. I was greeted with various encoding issues. Trademarks showing up as â„¢ character sequences. Now, normally when you see a page with an Â or â at the start of a strange sequence you can be fairly certain it is a Unicode encoding, typically UTF-8. So at first I thought my auto-detect within Firefox was not turned on, checked it, no, it was definitely on. Selected unicode as encoding myself and, indeed, the page displayed normally. So I checked the page’s source. I was lovingly greeted by the following:

&lt;?xml version="1.0" encoding="iso-8859-1"?&gt;

I am sure most of you can appreciate the delightful irony that the organization that has a multitude of XML-based standards and specifications, which almost always use UTF-8 as default encoding, encode a page wrongly. Yes, mistakes are human, but to see something like this on the W3C site…

Edit: for some reason WordPress keeps converting my greater and lesser than signs into HTML entities, even when using Unicode entities.

# WordPress, MySQL, UTF-8 or why some links might temporarily not work

So I found out that MySQL had defaulted to latin1_swedish_ci when I first started this weblog database. Sily me for expecting a saner default like UTF-8.

I spent the past two days converting data. The majority of the tables were no problem, but wp_posts.post_name is tied with something which causes a key error to be displayed. I worked around this problem by writing both a PHP and Python script that took the current data from the table’s column, escape as needed, URL decode it as necessary, store it, alter the table to utf8_unicode_t, and pump back the data.

The reason I first had a Python version was that I did not even think of using Python. I guess I was looking to intently at the WordPress sources and got stuck in thinking ‘PHP’. After many hours of frustrating around with PHP’s APIs I went to Python and wrote a resulting script in a fraction of the time.

When I stared to verify the data in my mysql console output I was wondering what I was missing since I saw with a SELECT post_name FROM wp_posts; only ???? instead of kanji. The question marks are normally replacement characters used when conversion went ok but with small issues. Silly me for forgetting I had not done a SET NAMES utf8;.

Now I am walking all links to see if they’re actually clickable. Seems after you edit them and save them it corrects some database entries.

Of course, it seems my slugs vary wildly. Older entries use some weird underscore based scheme, I wonder if that was a left-over from my Drupal import that I never noticed. It goes against a lot of persistent URL guidelines, but for the sake of consistency I am updating every single post just to be on the safe side. The search engines will correct over time, I just hope I won’t break too many referrers.

# MySQL and UTF-8 – missed chances

One thing I never understood is why MySQL insists on creating a table with ‘latin1_swedish_ci’ as ‘collation’. Now, this does more than just collation, it specifies encoding, collation order, and case sensitivity. That’s not the issue, but why, oh why, does it insist on making this the default? What is wrong with actually using UTF-8? I mean, MySQL is only used across the world, which means the geographic spread when it comes to character sets would be served by actually having a default that could handle those languages! A missed chance if you ask me.

# Bone radical, number 188 – 骨

In the radical classification system called Kang Xi after the Chinese emperor Kang Xi we find 214 radicals. At position 188 we have the radical nicknamed ‘bone’ ( – hone). It is part of the group of radicals consisting out of 10 strokes (部首 – bushu).

The above image shows the character ‘bone’ in four fonts for the three languages of Chinese, Japanese, and Korean. The fonts used are STSong (Chinese), MingLiu (Chinese), MS Mincho (Japanese) and Batang (Korean). As can be seen the Chinese font is the only one that squares off the top image’s corner on the left-hand side. The other Chinese font and the Japanese and Korean font do so on the right-hand side.

I raised this issue on the Unicode list since the Unicode character charts have three points where ‘bone’ is encoded, to note: CJK Radicals Supplement 0x2ee3 (left-hand side), Kangxi Radicals 0x2fbb (right-hand side), and CJK Unified Ideographs 0x9aa8 (left-hand side).

I wonder if the discrepancy is a wrongly written letter during buddhist studies which was taken from China to Japan and subsequently later exported to Korea.

# ZSH and Unicode

ZSH in its released forms has no support at all for use of Unicode. In the CVS there are some changes to making this work as it should.
Which means finally able to use Nihongo on the command line. Yay.

# X can suck hard at times

And to the question why people still use Windows try setting up your X environment to properly support MathML with Firefox.

Truly, using new fonts within X is a black art still reminiscent of dark and medieval times when we did not know better. I thought we would have progressed that stage by now.

From a user perspective Windows definitely wins hands down in this, drag a file to a Fonts folder, done.
No, X wants us to use crazy incantations of mkfontdir, mkfontscale, fc-cache, ttmkfdir, xset with various fp options and hope xlsfonts shows the font you are after.

Users do NOT want to be bothered with foundries, weights, encoding types, and what not. They just want to add a font, select it in their favourite application and go: “owww, pretty!”

Is that, anno 2004, too much to ask?

Apparently…