Topic: The migration to utf-8 delete text, data loss!

hi,
the converstion to utf-8 in the punbb installation/upgrade script delete all the text that comes after an accent (like in áéíóú).

This means that upgrading to punbb 1.3 will mean loss of data. You will end up with text that just stops. Its a problem in the database, I confirmed with pma

Don't know why a forum upgrade has to modify any user data in the database

Don't know why a forum software can not use the official php files, instead of shipping with some copy of php4 files.

Right now I have to rollback and loss hours of work. Punbb migration from 1.2 to 1.3 is not looking any good and sorry about the flame, but I never found any problem with punbb until now, and I start not to like it as I used to ...

http://tinymailto.com/oliversl <-- my email after a captcha

Re: The migration to utf-8 delete text, data loss!

What DB, version are you using? Could you send us the dump of your database, please?

Re: The migration to utf-8 delete text, data loss!

Slavok wrote:

What DB, version are you using? Could you send us the dump of your database, please?

I was using mysql, just a create an empty DB, install punbb 1.2, insert a text like "123á456" and upgrade to punbb 1.3. You will loss the string "456" in all the text column of your DB sad

Will try to do some dump of the DB

Regards,
Oliver

http://tinymailto.com/oliversl <-- my email after a captcha

Re: The migration to utf-8 delete text, data loss!

Hi Slavok,
I have a DB dump now, how can I sent it to you?

I have Centos 4.7, Mysql 4.1.22, MySQL charset UTF-8, MySQL tables in charset latin1_spanish_ci

Meanwhile I will do the upgrade again with a small about of data in the DB.

Thanks for your time
Oliver

http://tinymailto.com/oliversl <-- my email after a captcha

Re: The migration to utf-8 delete text, data loss!

Hi Slavok,
I now have confirmed and reproduced the bug.

I created a topic with this subject:
"Topic 1 áéíóú this is a test"

and after the upgrade, the Topic ended up(I checked in phpmyadmin too):
"Topic 1 "

Please configure any punbb 1.2.x test forum like this, and then upgrade it to punbb 1.3.2 to confirm the bug:

Image 1
Image 2
Image 3

Thanks
Oliver
P.D.: this does not work in punbb.org, how do I post screenshots?

http://tinymailto.com/oliversl <-- my email after a captcha

Re: The migration to utf-8 delete text, data loss!

I noticed this morning that exactly the same issue will block upgrading my forum from 1.2.19 to 1.3.2.

My current installation of punBB is on Mysql 4.1.20, MySQL charset UTF-8, MySQL tables in charset latin1_general_ci

Upon upgrade I had exactly the same result as oliver: all database entries were chopped off at the point where a non ASCII character (typically é or è in my case) was encountered.

Cheers,
kingka

Re: The migration to utf-8 delete text, data loss!

Slavok, how can we advance in this solution?

Its clear punbb 1.3.2 is causing massive data loss when a forum is not in english.

Many thanks
Oliver

http://tinymailto.com/oliversl <-- my email after a captcha

8

Re: The migration to utf-8 delete text, data loss!

Actually, the problem is almost certainly caused by non-matching charset in your table definition and your tables - your tables use latin1_spanish_ci but your actual data is something else (e.g. cp1250). I had a problem with that myself, had to fix that first. After I made sure the definition and the data match, the upgrade went just fine.

You can perform the fixing using a sequence of ALTER TABLE's:
1) Convert all character fields to either BLOB or BINARY without changing charset:

ALTER TABLE table MODIFY fieldname BLOB

2) Convert all character fields back to the correct type with correct charset:

ALTER TABLE table MODIFY fieldname VARCHAR(100) CHARACTER SET real_charset COLLATE real_collation NOT NULL

3) When all fields are converted, change the declaration of the table itself:

ALTER TABLE table CHARACTER SET real_charset COLLATE real_collation NOT NULL

Also, you may prefer to try it out on a copy of your data - just copy all tables to a new name using this sequence:

CREATE TABLE xyz_posts LIKE punbb_posts;
INSERT INTO xyz_posts SELECT * FROM punbb_posts;

Then just rewrite $db_prefix in your config.php from punbb_ to xyz_

Re: The migration to utf-8 delete text, data loss!

Thanks pepak,
have you done this all manually? There are many many columns in each table.

Should't the "upgrade" script have error checking in such a case?

Oliver

http://tinymailto.com/oliversl <-- my email after a captcha

10

Re: The migration to utf-8 delete text, data loss!

oliversl wrote:

have you done this all manually? There are many many columns in each table.

Semi-manually. I wrote a script that can generate the sequence of SQL commands, as long as you take care to set up its content correctly.

Should't the "upgrade" script have error checking in such a case?

The main problem is finding which character set is actually used for the data. I can't really imagine how a script would do that - it really has precious little information to work with.

You can have my script, but YOU MUST HAVE A BACKUP OF YOUR DATA BEFORE YOU USE IT and it is your responsibility to find the correct character sets. If you fuck up your data because you didn't back up your table and didn't find out the correct character set, don't ask me to help you - in all probability your data will have been lost already.

http://www.studna.net/temp/convertmysql.zip

Re: The migration to utf-8 delete text, data loss!

Many thanks pepak.

What I want is to stop the further data loss that the upgrade script does. The users should be the ones that migrate the data to UTF-8, *not* punbb's upgrade script.

Slavok, please remove the conversion code to UTF-8 in the migration script. It is not fail safe, it is not secure to continue its distribution

Thanks
Oliver

http://tinymailto.com/oliversl <-- my email after a captcha

12

Re: The migration to utf-8 delete text, data loss!

I can't reproduce this data loss. I created a db with latin1_spanish_ci collation, wrote a post with message "123á456", and after the migration I do see it. What did I miss?

Re: The migration to utf-8 delete text, data loss!

Got the same problem, what the hell? Can you help me, I'm no expert with database.

14

Re: The migration to utf-8 delete text, data loss!

Slavok wrote:

I can't reproduce this data loss. I created a db with latin1_spanish_ci collation, wrote a post with message "123á456", and after the migration I do see it. What did I miss?

1) Create a table:

CREATE TABLE chartest (
  id INTEGER NOT NULL PRIMARY KEY,
  text VARCHAR(50) CHARACTER SET latin1 COLLATION latin1_swedish_ci
);

2) Create a form which accepts data in different encoding than latin1:

<?php
if ($_GET['text'])
  if ($conn = mysql_connect(...))
    if (mysql_select_db(...))
      mysql_query("INSERT INTO chartest VALUES ('".addslashes($_GET['text'])."')";
?>
<html>
<head>
<meta name="Content-type" content="text/html; charset=Windows-1250">
</head>
<body>
<form method="get">
<input type="text" name="text" value="<?php echo htmlspecialchars($_GET['text']); ?>">
<input type="submit">
</form>
</body>
</html>

3) Now submit some texts containing characters that are not present in latin1.

Příliš žluťoučký kůň úpěl ďábelské ódy

4) As long as you keep MySQL's charset to latin1 and page's charset to Windows-1250, the texts read from the table will appear correctly. But you must not perform any conversions - those will lead to data loss. E.g. this is a no-no:

ALTER TABLE chartest
  ALTER text VARCHAR(50) CHARACTER SET utf8;

This, too:

SET NAMES utf8; -- page is now in UTF-8
SELECT * FROM chartest;

Re: The migration to utf-8 delete text, data loss!

So how can I fix this?

Re: The migration to utf-8 delete text, data loss!

Mr.Awesome wrote:

So how can I fix this?

There is no fix from punbb, just revert your DB to a recent backup and stay in punbb 1.2

HTH
Oliver

http://tinymailto.com/oliversl <-- my email after a captcha

Re: The migration to utf-8 delete text, data loss!

Are you serious ? This is amator work to have such a flaw :s

Re: The migration to utf-8 delete text, data loss!

I think the problem is that punbb can not confirm it. so, there is no fix. All we can do is to keep reporting it until they can confirm it.

For me was a big deal, I lost some data because I have a not so recent backup. Nowhere in the upgrade instrucctions were a warning about modifying the DB data. Never before a punbb upgrade has modifyied user DB data sad

http://tinymailto.com/oliversl <-- my email after a captcha

Re: The migration to utf-8 delete text, data loss!

oliversl wrote:

Nowhere in the upgrade instructions were a warning about modifying the DB data.

Seems like the warning is there.

Re: The migration to utf-8 delete text, data loss!

If you have been updating punbb 1.2.x in the past, you knew punbb always update the DB but only the data used by punbb installation itself.
Never before punbb modified user data, posts and topics, and that warning of behavior change has never been made.

People use to read documentation prior warnings in a install script.
http://punbb.informer.com/docs/install.html

It would make sense if punbb developers acknowledge the problem and try to find a solution with the community. Also, it will really help to document big changes in the upgrade procedure, like parsing and changing the charset of your posts and topics.

HTH
Oliver

http://tinymailto.com/oliversl <-- my email after a captcha

21

Re: The migration to utf-8 delete text, data loss!

Once again, developers can't do much if the data in the tables doesn't match table structures - there is very little (if anything) they can do to detect this problem, even less to fix it. This needs to be done manually, e.g. with the help of my script above.

22

Re: The migration to utf-8 delete text, data loss!

(Another issue is that there was no point in converting the tables to UTF8. The MySQL server would happily convert them to UTF8 "on the fly". That would not solve the problem of incorrect characters appearing, but it would solve these issues of data loss. Too late to do something about that now if you did do a 1.3 upgrade; the developers could, and IMHO should, release a new version without UTF conversion for those who didn't upgrade yet.)

Re: The migration to utf-8 delete text, data loss!

Posts and topics processing is required because 1.3 uses UTF-8. Also its parser works in a different way.

oliversl wrote:

People use to read documentation prior warnings in a install script.
http://punbb.informer.com/docs/install.html

Actually there is no instruction of upgrading from 1.2 to 1.3. The instruction is here:
http://punbb.informer.com/wiki/punbb13/ … _punbb_1.2

It's our oversight if people use old documentation instead new one. We'll move all docs to wiki.

The issue has already been acknowledged:
http://punbb.informer.com/wiki/punbb13/ … 1.3.2_bugs
However, we have no precise description of the issue (due to a lack of repeatability) and consequently can't suggest a fix.

Re: The migration to utf-8 delete text, data loss!

I can send you my database and you'll reproduce the bug as much as you want lol. Just use a punbb 1.2 with a french language pack, post some messages with "é,è,à .." and do an upgradeto 1.3. The tables wich are in latin1 are gonna be converted in utf-8 and all the data after a "é,è,à .." are gonna be lost.

Re: The migration to utf-8 delete text, data loss!

Ok, I'll try to do this next week.