# MySQL to PostgreSQL

David Fetter informed me that he has some changes to a document I wrote back in 2005 about migration tips going from MySQL to PostgreSQL. As such I started to revamping the document, moving from DocBook 4.3 to 4.5 in the process. The result will be located at http://www.in-nomine.org/~asmodai/mysql-to-pgsql.html. I hope to have his comments merged in in the remainder of the week. Feel free to send me updates after it got properly updated with David’s notes.

# WordPress, MySQL, UTF-8 or why some links might temporarily not work

So I found out that MySQL had defaulted to latin1_swedish_ci when I first started this weblog database. Sily me for expecting a saner default like UTF-8.

I spent the past two days converting data. The majority of the tables were no problem, but wp_posts.post_name is tied with something which causes a key error to be displayed. I worked around this problem by writing both a PHP and Python script that took the current data from the table’s column, escape as needed, URL decode it as necessary, store it, alter the table to utf8_unicode_t, and pump back the data.

The reason I first had a Python version was that I did not even think of using Python. I guess I was looking to intently at the WordPress sources and got stuck in thinking ‘PHP’. After many hours of frustrating around with PHP’s APIs I went to Python and wrote a resulting script in a fraction of the time.

When I stared to verify the data in my mysql console output I was wondering what I was missing since I saw with a SELECT post_name FROM wp_posts; only ???? instead of kanji. The question marks are normally replacement characters used when conversion went ok but with small issues. Silly me for forgetting I had not done a SET NAMES utf8;.

Now I am walking all links to see if they’re actually clickable. Seems after you edit them and save them it corrects some database entries.

Of course, it seems my slugs vary wildly. Older entries use some weird underscore based scheme, I wonder if that was a left-over from my Drupal import that I never noticed. It goes against a lot of persistent URL guidelines, but for the sake of consistency I am updating every single post just to be on the safe side. The search engines will correct over time, I just hope I won’t break too many referrers.

# MySQL and UTF-8 – missed chances

One thing I never understood is why MySQL insists on creating a table with ‘latin1_swedish_ci’ as ‘collation’. Now, this does more than just collation, it specifies encoding, collation order, and case sensitivity. That’s not the issue, but why, oh why, does it insist on making this the default? What is wrong with actually using UTF-8? I mean, MySQL is only used across the world, which means the geographic spread when it comes to character sets would be served by actually having a default that could handle those languages! A missed chance if you ask me.