So I found out that MySQL had defaulted to
latin1_swedish_ci when I first started this weblog database. Sily me for expecting a saner default like UTF-8.
I spent the past two days converting data. The majority of the tables were no problem, but
wp_posts.post_name is tied with something which causes a key error to be displayed. I worked around this problem by writing both a PHP and Python script that took the current data from the table’s column, escape as needed, URL decode it as necessary, store it, alter the table to
utf8_unicode_t, and pump back the data.
The reason I first had a Python version was that I did not even think of using Python. I guess I was looking to intently at the WordPress sources and got stuck in thinking ‘PHP’. After many hours of frustrating around with PHP’s APIs I went to Python and wrote a resulting script in a fraction of the time.
When I stared to verify the data in my mysql console output I was wondering what I was missing since I saw with a
SELECT post_name FROM wp_posts; only
???? instead of kanji. The question marks are normally replacement characters used when conversion went ok but with small issues. Silly me for forgetting I had not done a
SET NAMES utf8;.
Now I am walking all links to see if they’re actually clickable. Seems after you edit them and save them it corrects some database entries.
Of course, it seems my slugs vary wildly. Older entries use some weird underscore based scheme, I wonder if that was a left-over from my Drupal import that I never noticed. It goes against a lot of persistent URL guidelines, but for the sake of consistency I am updating every single post just to be on the safe side. The search engines will correct over time, I just hope I won’t break too many referrers.