<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<title type="html"><![CDATA[PunBB Forums - The migration to utf-8 delete text, data loss!]]></title>
	<link rel="self" href="http://punbb.informer.com/forums/feed/atom/topic/21099/"/>
	<updated>2009-04-08T05:31:44Z</updated>
	<generator>PunBB</generator>
	<id>http://punbb.informer.com/forums/topic/21099/the-migration-to-utf8-delete-text-data-loss/</id>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/126511/#p126511"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>Mr.Awesome wrote:</cite><blockquote><p>I&#039;ve past this bug by doing a backup of tables, topics, posts and forums and import them back with some modification after the upgrade --&gt; result <a href="http://futurama-france.fr/forum/index.php">http://futurama-france.fr/forum/index.php</a></p></blockquote></div><p>could you please explain what you did in more details</p><p>i tried to do what has been said here but no success.</p><p>Or anyone have an idea how to import back with modification ?</p>]]></content>
			<author>
				<name><![CDATA[gorsan]]></name>
				<uri>http://punbb.informer.com/forums/user/13930/</uri>
			</author>
			<updated>2009-04-08T05:31:44Z</updated>
			<id>http://punbb.informer.com/forums/post/126511/#p126511</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125954/#p125954"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>Parpalak wrote:</cite><blockquote><p>Are you sure that these ALTER queries will work on PostgreSQL and SQLite?</p></blockquote></div><p>I fail to see how that is relevant - database conversion code will almost certainly need to be hard-coded for every database separately. That is, if you want a reliable code.</p><div class="quotebox"><blockquote><p>It was not me who designed db_update.php so I can&#039;t explain its logic in details. To tell you the truth, I&#039;m still confused a little with all these encodings and collations in databases. But I want to fix bugs if they exists and will continue investigating.</p></blockquote></div><p>Well, my main database is Firebird so I can&#039;t really tell you details about PostgreSQL and SQLite.</p><p>With MySQL, you don&#039;t care what encoding the data is stored in the database. All you need to do to get UTF8 output, regardless of encoding actually used by the database, is:</p><p>1) Make sure table structure matches table data. Which is <strong>NOT</strong> the case with many PunBB 1.2 installations, including mine - PunBB 1.2 did not create the tables correctly.</p><p>2) Make sure SET NAMES utf8 is called before any other SQL command.</p><p>Even if #1 is not satisfied, this approach will not lead to data loss on old data - old posts will simply display incorrectly, but as soon as table structure is fixed to match the data, everything will be fine.</p><p>Upgrade script uses a much more dangerous approach of reading all data, converting it to UTF8 and writing it back.</p>]]></content>
			<author>
				<name><![CDATA[pepak]]></name>
				<uri>http://punbb.informer.com/forums/user/14781/</uri>
			</author>
			<updated>2009-03-21T05:48:38Z</updated>
			<id>http://punbb.informer.com/forums/post/125954/#p125954</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125938/#p125938"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>pepak wrote:</cite><blockquote><p>And if old charset is NOT ISO-8859-1 and neither iconv and mb_convert_encoding exist, leaves the string unchanged but tells MySQL that it is UTF8.</p></blockquote></div><p>No, a message is displayed in this case:<br /><a href="http://punbb.informer.com/trac/browser/punbb/trunk/admin/db_update.php#L426">http://punbb.informer.com/trac/browser/ &#133; e.php#L426</a></p><p>Are you sure that these ALTER queries will work on PostgreSQL and SQLite?</p><p>It was not me who designed db_update.php so I can&#039;t explain its logic in details. To tell you the truth, I&#039;m still confused a little with all these encodings and collations in databases. But I want to fix bugs if they exists and will continue investigating.</p>]]></content>
			<author>
				<name><![CDATA[Parpalak]]></name>
				<uri>http://punbb.informer.com/forums/user/13581/</uri>
			</author>
			<updated>2009-03-20T22:11:10Z</updated>
			<id>http://punbb.informer.com/forums/post/125938/#p125938</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125935/#p125935"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>Parpalak wrote:</cite><blockquote><p>Actually, the update script asks the encoding of the language pack before updating. Then it converts posts to UTF8 by calling this function for every post:<br /><a href="http://punbb.informer.com/trac/browser/punbb/trunk/admin/db_update.php#L231">http://punbb.informer.com/trac/browser/ &#133; e.php#L231</a><br />And then it tells MySQL that the data encoding is UTF8.</p></blockquote></div><p>And if old charset is NOT ISO-8859-1 and neither iconv and mb_convert_encoding exist, leaves the string unchanged but tells MySQL that it is UTF8.</p><p>What the upgrade should do, and what WOULD be foolproof provided that the user tells the correct encoding, would be a sequence of ALTER TABLEs:</p><p>1) ALTER TABLE ... ALTER [string_field] BLOB<br />2) ALTER TABLE ... ALTER [string_field] [original_type] [user&#039;s_encoding]<br />3) ALTER TABLE ... ALTER [string_field] [original_type] CHARACTER SET utf8</p><p>Or even just steps 1 and 2, those would suffice and might be even safer. Conversion to UTF8 can be done on-request thanks to SET NAMES utf8.</p>]]></content>
			<author>
				<name><![CDATA[pepak]]></name>
				<uri>http://punbb.informer.com/forums/user/14781/</uri>
			</author>
			<updated>2009-03-20T20:27:12Z</updated>
			<id>http://punbb.informer.com/forums/post/125935/#p125935</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125934/#p125934"/>
			<content type="html"><![CDATA[<p>Actually, the update script asks the encoding of the language pack before updating. Then it converts posts to UTF8 by calling this function for every post:<br /><a href="http://punbb.informer.com/trac/browser/punbb/trunk/admin/db_update.php#L231">http://punbb.informer.com/trac/browser/ &#133; e.php#L231</a><br />And then it tells MySQL that the data encoding is UTF8.</p>]]></content>
			<author>
				<name><![CDATA[Parpalak]]></name>
				<uri>http://punbb.informer.com/forums/user/13581/</uri>
			</author>
			<updated>2009-03-20T19:18:06Z</updated>
			<id>http://punbb.informer.com/forums/post/125934/#p125934</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125924/#p125924"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>Parpalak wrote:</cite><blockquote><p><a href="http://punbb.informer.com/trac/browser/punbb/trunk/admin/db_update.php#L338">http://punbb.informer.com/trac/browser/ &#133; e.php#L338</a></p></blockquote></div><p>No wonder the users are losing their data!</p><p>What this function does is, it assumes that there is UTF8 data in the table and modifies the structure to match that. If the assumption is wrong - and it will often be wrong - it will simply take data in current encoding and tell MySQL that it is in fact UTF8. Which leads to data loss in itself, and if it just happens that the source data contain sequences not permitted under UTF8, the string will likely get truncated at that point. When I was upgrading to 1.3, for example, my data was in cp1250...</p>]]></content>
			<author>
				<name><![CDATA[pepak]]></name>
				<uri>http://punbb.informer.com/forums/user/14781/</uri>
			</author>
			<updated>2009-03-20T14:41:05Z</updated>
			<id>http://punbb.informer.com/forums/post/125924/#p125924</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125923/#p125923"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>Parpalak wrote:</cite><blockquote><p>As far as I understand the update script works just like you have described:<br /><a href="http://punbb.informer.com/trac/browser/punbb/trunk/admin/db_update.php#L338">http://punbb.informer.com/trac/browser/ &#133; e.php#L338</a></p></blockquote></div><p>The difference is that the script <strong>does not know the correct charset</strong> and I see no easy way for it to recognize it.</p>]]></content>
			<author>
				<name><![CDATA[pepak]]></name>
				<uri>http://punbb.informer.com/forums/user/14781/</uri>
			</author>
			<updated>2009-03-20T14:35:48Z</updated>
			<id>http://punbb.informer.com/forums/post/125923/#p125923</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125917/#p125917"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>pepak wrote:</cite><blockquote><p>You can perform the fixing using a sequence of ALTER TABLE&#039;s:<br />1) Convert all character fields to either BLOB or BINARY without changing charset:<br />2) Convert all character fields back to the correct type with correct charset:<br />3) When all fields are converted, change the declaration of the table itself:</p></blockquote></div><p>As far as I understand the update script works just like you have described:<br /><a href="http://punbb.informer.com/trac/browser/punbb/trunk/admin/db_update.php#L338">http://punbb.informer.com/trac/browser/ &#133; e.php#L338</a></p><p>Before 1.3 release we had tested the update process and had added SET NAMES:<br /><a href="http://punbb.informer.com/trac/changeset/858/punbb/branches/punbb-1.3-dev/admin/db_update.php">http://punbb.informer.com/trac/changese &#133; update.php</a></p><p>Maybe this SET NAMES call is wrong, we&#039;ll continue testing.</p>]]></content>
			<author>
				<name><![CDATA[Parpalak]]></name>
				<uri>http://punbb.informer.com/forums/user/13581/</uri>
			</author>
			<updated>2009-03-20T10:52:39Z</updated>
			<id>http://punbb.informer.com/forums/post/125917/#p125917</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125680/#p125680"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>Parpalak wrote:</cite><blockquote><p>Posts and topics processing is required because 1.3 uses UTF-8. Also its parser works in a different way.</p></blockquote></div><p>Yes, of course. What I am trying to say is that there is no reason to convert the <strong>database</strong> during upgrade. MySQL will accept and return data in UTF8 regardless of what encoding is used on the tables, as long as SET NAMES utf8 is used.</p><p>As for repeatability, my post should do it just fine.<br /><a href="http://punbb.informer.com/forums/post/125628/#p125628">http://punbb.informer.com/forums/post/125628/#p125628</a></p>]]></content>
			<author>
				<name><![CDATA[pepak]]></name>
				<uri>http://punbb.informer.com/forums/user/14781/</uri>
			</author>
			<updated>2009-03-15T06:12:27Z</updated>
			<id>http://punbb.informer.com/forums/post/125680/#p125680</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125677/#p125677"/>
			<content type="html"><![CDATA[<p>I&#039;ve past this bug by doing a backup of tables, topics, posts and forums and import them back with some modification after the upgrade --&gt; result <a href="http://futurama-france.fr/forum/index.php">http://futurama-france.fr/forum/index.php</a></p>]]></content>
			<author>
				<name><![CDATA[Mr.Awesome]]></name>
				<uri>http://punbb.informer.com/forums/user/15046/</uri>
			</author>
			<updated>2009-03-15T02:03:19Z</updated>
			<id>http://punbb.informer.com/forums/post/125677/#p125677</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125676/#p125676"/>
			<content type="html"><![CDATA[<div class="quotebox"><cite>Slavok wrote:</cite><blockquote><p>I can&#039;t reproduce this data loss. I created a db with latin1_spanish_ci collation, wrote a post with message &quot;123á456&quot;, and after the migration I do see it. What did I miss?</p></blockquote></div><p>Hi Slavok,<br />have you configured your mysql following these screenshots?<br /><a href="http://punbb.informer.com/forums/post/124662/#p124662">http://punbb.informer.com/forums/post/124662/#p124662</a></p><p>Parpalak,<br />glad to know you will look a the bug next week. </p><p>Many thanks to you both<br />Oliver</p>]]></content>
			<author>
				<name><![CDATA[oliversl]]></name>
				<uri>http://punbb.informer.com/forums/user/7537/</uri>
			</author>
			<updated>2009-03-15T01:40:44Z</updated>
			<id>http://punbb.informer.com/forums/post/125676/#p125676</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125673/#p125673"/>
			<content type="html"><![CDATA[<p>Ok, I&#039;ll try to do this next week.</p>]]></content>
			<author>
				<name><![CDATA[Parpalak]]></name>
				<uri>http://punbb.informer.com/forums/user/13581/</uri>
			</author>
			<updated>2009-03-14T22:57:05Z</updated>
			<id>http://punbb.informer.com/forums/post/125673/#p125673</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125672/#p125672"/>
			<content type="html"><![CDATA[<p>I can send you my database and you&#039;ll reproduce the bug as much as you want lol. Just use a punbb 1.2 with a french language pack, post some messages with &quot;é,è,à ..&quot; and do an upgradeto 1.3. The tables wich are in latin1 are gonna be converted in utf-8 and all the data after a &quot;é,è,à ..&quot; are gonna be lost.</p>]]></content>
			<author>
				<name><![CDATA[Mr.Awesome]]></name>
				<uri>http://punbb.informer.com/forums/user/15046/</uri>
			</author>
			<updated>2009-03-14T22:49:52Z</updated>
			<id>http://punbb.informer.com/forums/post/125672/#p125672</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125671/#p125671"/>
			<content type="html"><![CDATA[<p>Posts and topics processing is required because 1.3 uses UTF-8. Also its parser works in a different way.</p><div class="quotebox"><cite>oliversl wrote:</cite><blockquote><p>People use to read documentation prior warnings in a install script.<br /><a href="http://punbb.informer.com/docs/install.html">http://punbb.informer.com/docs/install.html</a></p></blockquote></div><p>Actually there is no instruction of upgrading from 1.2 to 1.3. The instruction is here:<br /><a href="http://punbb.informer.com/wiki/punbb13/faq#migration_from_punbb_1.2">http://punbb.informer.com/wiki/punbb13/ &#133; _punbb_1.2</a></p><p>It&#039;s our oversight if people use old documentation instead new one. We&#039;ll move all docs to wiki.</p><p>The issue has already been acknowledged:<br /><a href="http://punbb.informer.com/wiki/punbb13/bugs#punbb_1.3.2_bugs">http://punbb.informer.com/wiki/punbb13/ &#133; 1.3.2_bugs</a><br />However, we have no precise description of the issue (due to a lack of repeatability) and consequently can&#039;t suggest a fix.</p>]]></content>
			<author>
				<name><![CDATA[Parpalak]]></name>
				<uri>http://punbb.informer.com/forums/user/13581/</uri>
			</author>
			<updated>2009-03-14T22:32:00Z</updated>
			<id>http://punbb.informer.com/forums/post/125671/#p125671</id>
		</entry>
		<entry>
			<title type="html"><![CDATA[Re: The migration to utf-8 delete text, data loss!]]></title>
			<link rel="alternate" href="http://punbb.informer.com/forums/post/125667/#p125667"/>
			<content type="html"><![CDATA[<p>(Another issue is that there was no point in converting the tables to UTF8. The MySQL server would happily convert them to UTF8 &quot;on the fly&quot;. That would not solve the problem of incorrect characters appearing, but it <strong>would</strong> solve these issues of data loss. Too late to do something about that now if you did do a 1.3 upgrade; the developers could, and IMHO should, release a new version without UTF conversion for those who didn&#039;t upgrade yet.)</p>]]></content>
			<author>
				<name><![CDATA[pepak]]></name>
				<uri>http://punbb.informer.com/forums/user/14781/</uri>
			</author>
			<updated>2009-03-14T21:56:03Z</updated>
			<id>http://punbb.informer.com/forums/post/125667/#p125667</id>
		</entry>
</feed>
