Make Mercurial Filenames Work on Windows

While writing a post for my blog I noticed that on Windows some of the filenames on-disk showed encoding problems. I have this stored in Mercurial, so somewhere from Mercurial to the checkout on Windows something goes wrong where it concerns character encoding.

After some research and conversations with people on the #mercurial IRC channel, I understand that Mercurial stores everything internally in Python's byte encoding. On Windows it will then convert this to its native ANSI codepage, in my case codepage 1252.

Thankfully Windows 10 has a wonderful option nowadays to fix this issue. If you go to Control Panel, click Clock and Region, click Region, click Administrative, and under Language for non-Unicode programs click Change system locale. In the window that pops up tick the checkbox in front of Beta: Use Unicode UTF-8 for worldwide language support. Maybe by the time you are reading this the beta label has already been removed. Click OK and the system needs to restart.

You will need to clone the repository again since Mercurial (TortoiseHg) will need to properly generate the filenames.

Why using 'lorem ipsum' is bad for web site testing

The typesetting and webdesign industry has apparently been using the 'lorem ipsum' text for a while to provide a dummy text in order to test print and layout.

Aside from the fact that the text is a cut off section of Cicero's de finibus bonorum et malorum, it also fails in one huge aspect, namely globalisation.

The text is Latin, latin is the simplest of all characters we have available to us on the world-wide web. If your website is English only then, yes, you are quite done. However for a lot of us we also have to support languages other than English, the easiest of which are Latin-derived scripts.

Latin, and subsequently English, are both written left-to-right. Hebrew and Arabic, to take two prime examples, are written right-to-left (leaving numerals aside for the moment). Of course, this is very important to also test since it means a lot of change is needed for your lay out.

Especially when testing your design for sites that need to display multiple languages on the same page it is pertinent to test with multilingual text. One of the things that should quickly become clear is whether or not a sufficient encoding has been chosen.

The Beauty of Irony

I needed to look up something within a XHTML specification over at the W3 Consortium website. So I went to the XHTML2 Working Group Home Page. I was greeted with various encoding issues. Trademarks showing up as â„¢ character sequences. Now, normally when you see a page with an  or â at the start of a strange sequence you can be fairly certain it is a Unicode encoding, typically UTF-8. So at first I thought my auto-detect within Firefox was not turned on, checked it, no, it was definitely on. Selected Unicode as encoding myself and, indeed, the page displayed normally. So I checked the page's source. I was lovingly greeted by the following:

<?xml version="1.0" encoding="iso-8859-1"?>

I am sure most of you can appreciate the delightful irony that the organization that has a multitude of XML-based standards and specifications, which almost always use UTF-8 as default encoding, encode a page wrongly. Yes, mistakes are human, but to see something like this on the W3C site...