Why using 'lorem ipsum' is bad for web site testing

The typesetting and webdesign industry has apparently been using the 'lorem ipsum' text for a while to provide a dummy text in order to test print and layout.

Aside from the fact that the text is a cut off section of Cicero's de finibus bonorum et malorum, it also fails in one huge aspect, namely globalisation.

The text is Latin, latin is the simplest of all characters we have available to us on the world-wide web. If your website is English only then, yes, you are quite done. However for a lot of us we also have to support languages other than English, the easiest of which are Latin-derived scripts.

Latin, and subsequently English, are both written left-to-right. Hebrew and Arabic, to take two prime examples, are written right-to-left (leaving numerals aside for the moment). Of course, this is very important to also test since it means a lot of change is needed for your lay out.

Especially when testing your design for sites that need to display multiple languages on the same page it is pertinent to test with multilingual text. One of the things that should quickly become clear is whether or not a sufficient encoding has been chosen.

The Beauty of Irony

I needed to look up something within a XHTML specification over at the W3 Consortium website. So I went to the XHTML2 Working Group Home Page. I was greeted with various encoding issues. Trademarks showing up as â„¢ character sequences. Now, normally when you see a page with an  or â at the start of a strange sequence you can be fairly certain it is a Unicode encoding, typically UTF-8. So at first I thought my auto-detect within Firefox was not turned on, checked it, no, it was definitely on. Selected Unicode as encoding myself and, indeed, the page displayed normally. So I checked the page's source. I was lovingly greeted by the following:

<?xml version="1.0" encoding="iso-8859-1"?>

I am sure most of you can appreciate the delightful irony that the organization that has a multitude of XML-based standards and specifications, which almost always use UTF-8 as default encoding, encode a page wrongly. Yes, mistakes are human, but to see something like this on the W3C site...