David Fetter informed me that he has some changes to a document I wrote back in 2005 about migration tips going from MySQL to PostgreSQL. As such I started to revamping the document, moving from DocBook 4.3 to 4.5 in the process. The result will be located at http://www.in-nomine.org/~asmodai/mysql-to-pgsql.html. I hope to have his comments merged in in the remainder of the week. Feel free to send me updates after it got properly updated with David’s notes.
So I found out that MySQL had defaulted to
latin1_swedish_ci when I first started this weblog database. Sily me for expecting a saner default like UTF-8.
I spent the past two days converting data. The majority of the tables were no problem, but
wp_posts.post_name is tied with something which causes a key error to be displayed. I worked around this problem by writing both a PHP and Python script that took the current data from the table’s column, escape as needed, URL decode it as necessary, store it, alter the table to
utf8_unicode_t, and pump back the data.
The reason I first had a Python version was that I did not even think of using Python. I guess I was looking to intently at the WordPress sources and got stuck in thinking ‘PHP’. After many hours of frustrating around with PHP’s APIs I went to Python and wrote a resulting script in a fraction of the time.
When I stared to verify the data in my mysql console output I was wondering what I was missing since I saw with a
SELECT post_name FROM wp_posts; only
???? instead of kanji. The question marks are normally replacement characters used when conversion went ok but with small issues. Silly me for forgetting I had not done a
SET NAMES utf8;.
Now I am walking all links to see if they’re actually clickable. Seems after you edit them and save them it corrects some database entries.
Of course, it seems my slugs vary wildly. Older entries use some weird underscore based scheme, I wonder if that was a left-over from my Drupal import that I never noticed. It goes against a lot of persistent URL guidelines, but for the sake of consistency I am updating every single post just to be on the safe side. The search engines will correct over time, I just hope I won’t break too many referrers.
One thing I never understood is why MySQL insists on creating a table with ‘latin1_swedish_ci’ as ‘collation’. Now, this does more than just collation, it specifies encoding, collation order, and case sensitivity. That’s not the issue, but why, oh why, does it insist on making this the default? What is wrong with actually using UTF-8? I mean, MySQL is only used across the world, which means the geographic spread when it comes to character sets would be served by actually having a default that could handle those languages! A missed chance if you ask me.
I’ve been using Bugzilla for a while now for the TenDRA Project. I am quite happy with it. Right now I use it on a MySQL database, but everybody that really knows me knows I prefer PostgreSQL over MySQL because of the real database features it has.
To counter any knee-jerk reactions, no, MySQL is not bad, I am just used to features real databases provide (views, stored procedures, etc) and once you are used to that, MySQL feels, …, inadequate.
Bugzilla has, in the last year, been working hard on making the program database independent.
See the following URLs for developer details on progress:
This is what I also emailed (the URLs that is) to the authors of IEEE’s Software magazine column about Open Source. They wrote that Bugzilla is not database independent, and while that is undisputable (aside from the ‘hackish’ port to PostgreSQL on the side, which I also worked on) there is significant work being done behind the scenes on making it database independent.