Re: The migration to utf-8 delete text, data loss!

Slavok wrote:

I can't reproduce this data loss. I created a db with latin1_spanish_ci collation, wrote a post with message "123á456", and after the migration I do see it. What did I miss?

Hi Slavok,
have you configured your mysql following these screenshots?
http://punbb.informer.com/forums/post/124662/#p124662

Parpalak,
glad to know you will look a the bug next week.

Many thanks to you both
Oliver

http://tinymailto.com/oliversl <-- my email after a captcha

Re: The migration to utf-8 delete text, data loss!

I've past this bug by doing a backup of tables, topics, posts and forums and import them back with some modification after the upgrade --> result http://futurama-france.fr/forum/index.php

Re: The migration to utf-8 delete text, data loss!

Parpalak wrote:

Posts and topics processing is required because 1.3 uses UTF-8. Also its parser works in a different way.

Yes, of course. What I am trying to say is that there is no reason to convert the database during upgrade. MySQL will accept and return data in UTF8 regardless of what encoding is used on the tables, as long as SET NAMES utf8 is used.

As for repeatability, my post should do it just fine.
http://punbb.informer.com/forums/post/125628/#p125628

Re: The migration to utf-8 delete text, data loss!

pepak wrote:

You can perform the fixing using a sequence of ALTER TABLE's:
1) Convert all character fields to either BLOB or BINARY without changing charset:
2) Convert all character fields back to the correct type with correct charset:
3) When all fields are converted, change the declaration of the table itself:

As far as I understand the update script works just like you have described:
http://punbb.informer.com/trac/browser/ … e.php#L338

Before 1.3 release we had tested the update process and had added SET NAMES:
http://punbb.informer.com/trac/changese … update.php

Maybe this SET NAMES call is wrong, we'll continue testing.

Re: The migration to utf-8 delete text, data loss!

Parpalak wrote:

As far as I understand the update script works just like you have described:
http://punbb.informer.com/trac/browser/ … e.php#L338

The difference is that the script does not know the correct charset and I see no easy way for it to recognize it.

Re: The migration to utf-8 delete text, data loss!

No wonder the users are losing their data!

What this function does is, it assumes that there is UTF8 data in the table and modifies the structure to match that. If the assumption is wrong - and it will often be wrong - it will simply take data in current encoding and tell MySQL that it is in fact UTF8. Which leads to data loss in itself, and if it just happens that the source data contain sequences not permitted under UTF8, the string will likely get truncated at that point. When I was upgrading to 1.3, for example, my data was in cp1250...

Re: The migration to utf-8 delete text, data loss!

Actually, the update script asks the encoding of the language pack before updating. Then it converts posts to UTF8 by calling this function for every post:
http://punbb.informer.com/trac/browser/ … e.php#L231
And then it tells MySQL that the data encoding is UTF8.

Re: The migration to utf-8 delete text, data loss!

Parpalak wrote:

Actually, the update script asks the encoding of the language pack before updating. Then it converts posts to UTF8 by calling this function for every post:
http://punbb.informer.com/trac/browser/ … e.php#L231
And then it tells MySQL that the data encoding is UTF8.

And if old charset is NOT ISO-8859-1 and neither iconv and mb_convert_encoding exist, leaves the string unchanged but tells MySQL that it is UTF8.

What the upgrade should do, and what WOULD be foolproof provided that the user tells the correct encoding, would be a sequence of ALTER TABLEs:

1) ALTER TABLE ... ALTER [string_field] BLOB
2) ALTER TABLE ... ALTER [string_field] [original_type] [user's_encoding]
3) ALTER TABLE ... ALTER [string_field] [original_type] CHARACTER SET utf8

Or even just steps 1 and 2, those would suffice and might be even safer. Conversion to UTF8 can be done on-request thanks to SET NAMES utf8.

Re: The migration to utf-8 delete text, data loss!

pepak wrote:

And if old charset is NOT ISO-8859-1 and neither iconv and mb_convert_encoding exist, leaves the string unchanged but tells MySQL that it is UTF8.

No, a message is displayed in this case:
http://punbb.informer.com/trac/browser/ … e.php#L426

Are you sure that these ALTER queries will work on PostgreSQL and SQLite?

It was not me who designed db_update.php so I can't explain its logic in details. To tell you the truth, I'm still confused a little with all these encodings and collations in databases. But I want to fix bugs if they exists and will continue investigating.

Re: The migration to utf-8 delete text, data loss!

Parpalak wrote:

Are you sure that these ALTER queries will work on PostgreSQL and SQLite?

I fail to see how that is relevant - database conversion code will almost certainly need to be hard-coded for every database separately. That is, if you want a reliable code.

It was not me who designed db_update.php so I can't explain its logic in details. To tell you the truth, I'm still confused a little with all these encodings and collations in databases. But I want to fix bugs if they exists and will continue investigating.

Well, my main database is Firebird so I can't really tell you details about PostgreSQL and SQLite.

With MySQL, you don't care what encoding the data is stored in the database. All you need to do to get UTF8 output, regardless of encoding actually used by the database, is:

1) Make sure table structure matches table data. Which is NOT the case with many PunBB 1.2 installations, including mine - PunBB 1.2 did not create the tables correctly.

2) Make sure SET NAMES utf8 is called before any other SQL command.

Even if #1 is not satisfied, this approach will not lead to data loss on old data - old posts will simply display incorrectly, but as soon as table structure is fixed to match the data, everything will be fine.

Upgrade script uses a much more dangerous approach of reading all data, converting it to UTF8 and writing it back.

Re: The migration to utf-8 delete text, data loss!

Mr.Awesome wrote:

I've past this bug by doing a backup of tables, topics, posts and forums and import them back with some modification after the upgrade --> result http://futurama-france.fr/forum/index.php

could you please explain what you did in more details

i tried to do what has been said here but no success.

Or anyone have an idea how to import back with modification ?

Last edited by gorsan (2009-04-08 05:32:57)