1

Topic: Indexing is not unicode-safe

Hello, a wonderful engine and wonderful people here :-)

I installed punBB on my site. Did some little modifications to it as well (needed to integrate it into our own login system, but thanks to the advent of PostgreSQL this was a matter of making a view and a couple of triggers). The question follows.

I am storing my database in UTF-8. I used a russian localization and changed locale definitions there to ru_RU.UTF-8, after which I converted the localization files themselves into UTF-8. The board was working OK, but when creating a new message (or a new topic) with more than one word in it, the engine was giving me a "Cannot create search index" error (or the like), referencing to line 127 of search_idx.php

Currently I solved this problem by reinstalling russian language packs and hacking the DBDriver of PunBB th set the client encoding of the database connection accordingly. However, there is "one more thing" - alll of the output from the site goes into a UTF-8 encoded XSL template. If I will leave the forum engine in Win-1251 I am in for all kinds of problems when submitting forms (even if I will convert the forum data after with ob_get_contents()). I cannot use Win-1251 (most XML processors do not support it, and the main part of the site is anyways UTF-8 already).

The question follows - how can I modify the regex at line 127 so that it is multibyte-safe, or maybe it is possible to overload the regex engine with mbstring functions so that they can become multibyte-safe automatically?

Thanks in advance, and keep up the good work. PunBB certainly thrilled me :-)

I am running v. 1.1.5

Re: Indexing is not unicode-safe

Try replacing that line with this:

$result = $db->query('SELECT id, word FROM '.$db->prefix.'search_words WHERE word IN('.implode(',', preg_replace('#^(.*)$#u', '\'\1\'', $unique_words)).')') or error('Unable to fetch search index words', __FILE__, __LINE__, $db->error());

The only difference is the modifier u at the end.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

3

Re: Indexing is not unicode-safe

Tried it, still a no go.

Here is what I get when I look at my PostgreSQL statement log (enabled it especially for the case). Please note that queries before this one are logged with proper letters I am using (russian UTF-8 in this case). What comes out of this regex certainly doesn't look healthy to me.

Aug 28 23:26:19 exile PostgresCluster[3218]: [16-1] LOG:  statement: SELECT id, word FROM punbb_search_words WHERE word IN('XX','XX')
Aug 28 23:26:19 exile PostgresCluster[3218]: [17-1] ERROR:  Unicode characters greater than or equal to 0x10000 are not supported

In place of XX I see 2 garbage characters of unknow origin. What is this query supposed to do actually? Sorry to be so anal, I am just wondering.

Re: Indexing is not unicode-safe

Hmm, I'm sorry, but I have no idea what's going wrong. I have to confess I am far from an expert on localisation.

$unique_words contains a list of all the words in the post to be indexed. The select query then determines which of these words are already indexed so that it can insert only words that are new to the index.

Have you tried it with different browsers? I know it sounds silly, but I know IE handles "exotic" locales differently than e.g. Firefox.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

5 (edited by Julik 2004-08-29 13:03)

Re: Indexing is not unicode-safe

I tried it with all good-mannered browsers I have (Firefox and Safari). Got this practice of developing for good user agents before anything even gets to IE.

There may (!) be a problem with the fact that the error screen I get is sent as ISO charset, but the POST data should still be accurate.

I have to find out what is going on with the words before they get to the database (instead of your regex I tried to substitute a generic database quoting routine from ADODB, same result, so I guess the characters get mangled even before they get to be quoted for the database), because everything else just works fine - postin g(if your post contains one long word), userlists etc.

How can I trace variables that go through this script, I tried putting print_r there but your script evidently suppresses all output.

Probably this is just a case of some array_ function of PHP being not multibyte-safe.

6 (edited by Julik 2004-08-29 14:48)

Re: Indexing is not unicode-safe

Looks like I nailed it.

The regexes in the split_words functrion were destroying all multibyte text.

Currently I made the following fixes.

1. Went to the .htaccess and set the following:

php_value mbstring.internal_encoding 'UTF-8'
php_value mbstring.func_overload '7'

This way the mbstring extension will overload all string manipulation functions and make the multibyte-safe for us. I assume it is possible to include these calls in locale files of punBB (make a variable called, say, $mbstring_encoding, and set it to 'UTF-8' or whatever). If the mbstring extension is compiled, the respective calls should be made (see http://www.php.net/manual/en/function.m … coding.php).

2. Replaced regexes in split_words with array walks and string functions.
3. Replaced your regexes in index insertion SQL requests with pg_escape_string() calls (it is myltibyte-safe and also a native libpq function that should be faster). Essentially it is a good idea to make a qstr handler in the DBLayer (in a fashion of $db->quote ($words_array))

Seems to work, let me know if you need the code.

Re: Indexing is not unicode-safe

Julik wrote:

How can I trace variables that go through this script, I tried putting print_r there but your script evidently suppresses all output.

Try dump($foo). It's a small debugging function I use very frequently :)

Julik wrote:

Looks like I nailed it.

The regexes in the split_words functrion were destroying all multibyte text.

Currently I made the following fixes.

1. Went to the .htaccess and set the following:

php_value mbstring.internal_encoding 'UTF-8'
php_value mbstring.func_overload '7'

This way the mbstring extension will overload all string manipulation functions and make the multibyte-safe for us. I assume it is possible to include these calls in locale files of punBB (make a variable called, say, $mbstring_encoding, and set it to 'UTF-8' or whatever). If the mbstring extension is compiled, the respective calls should be made (see http://www.php.net/manual/en/function.m … coding.php).

2. Replaced regexes in split_words with array walks and string functions.
3. Replaced your regexes in index insertion SQL requests with pg_escape_string() calls (it is myltibyte-safe and also a native libpq function that should be faster). Essentially it is a good idea to make a qstr handler in the DBLayer (in a fashion of $db->quote ($words_array))

Seems to work, let me know if you need the code.

Nice job! I would love to have a look at the code changes you've done (you can e-mail it to me). However, I don't believe it is material for the official codebase just yet. mbstring isn't included in the default install of PHP and thus, lots of systems are up and running without it. It also comes with a performance penalty in comparison to the native singlebyte string routines in PHP (which are used very frequently I might add). It's a sad story, but I don't believe it's possible to write a UTF-8 capable discussion board in PHP today that works with multiple databases and the millions of different configurations that occur. There's always going to be something missing or something misconfigured.

Oh, regarding DB specific quoting. From the changelog for 1.2:

*  Moved function escape() into the DB layer. To call the function, use
   $db->escape(). The move of the function to the DB layer is required
   because different databases use different escaping mechanisms. E.g. just
   using addslashes() doesn't work for SQLite.
"Programming is like sex: one mistake and you have to support it for the rest of your life."

8 (edited by Julik 2004-08-29 21:37)

Re: Indexing is not unicode-safe

Sure it is not. However, instead of declaring that 'punBB is not multibyte-safe' you can make a code fork (not a big one, I mind you).

What I thouhgt is that, essentially, you write in the readme - 'you need it to work multibyte - presto, substitute file X with file Y and you are all set, you need extension Z and extension K for this to work'). People who gonna need this (for example Asian PHP installs) usually have the extension builtin and usually they are familiar with the issue.

PostgreSQL for instance (as well as MySQL of recent versions) both have calls to determine the current client charset and change it. Th only problem I encountered as of yet is with this indexing feature (which is apparently related to the regexps, and it seems that mbstring is not overloading the PCRE extension).

I will send you the code - just make a diff with the one of yours (mine will surely work slowly because I never could grasp regexps).

BTW. a good candidate for a UTF-8 solution is this very forum - just look at these fancy ????? in my post in the russian section :-)

Re: Indexing is not unicode-safe

A good idea. I'll have a look at it tomorrow. I'm just too tired right now :)

"Programming is like sex: one mistake and you have to support it for the rest of your life."

10

Re: Indexing is not unicode-safe

I'm having the same problems with my romanian forum... is this fixed already, or can u Julik sen me a copy of that code... thanx.

11

Re: Indexing is not unicode-safe

I don't know, I made this patch maybe a couple of years ago. Now I would suggest you to push Rickard to do it internally in the proper fashion.

12

Re: Indexing is not unicode-safe

I added that htaccess thingy, not sure, but seems fixed, even reindexing works...

Re: Indexing is not unicode-safe

Rickard wrote:

However, I don't believe it is material for the official codebase just yet. mbstring isn't included in the default install of PHP and thus, lots of systems are up and running without it. It also comes with a performance penalty in comparison to the native singlebyte string routines in PHP (which are used very frequently I might add). It's a sad story, but I don't believe it's possible to write a UTF-8 capable discussion board in PHP today that works with multiple databases and the millions of different configurations that occur. There's always going to be something missing or something misconfigured.

Textpattern is no board, but it handle utf-8 quite nice. wink

14 (edited by Eugene One Gin 2006-07-12 04:39)

Re: Indexing is not unicode-safe

Tried some of the changes above.

Looks like the the function strtolower was killing the multibyte characters, so, the regex was not really the problem. I had to replace strtolower  with mb_strtolower in both search_idx.php and search.php. After doing that, the words are inserted into the database properly and searching works too.

PHP5, MySQL 4.1, Apache 2.2, Win32. Will try my changes later on PHP 4.x on Apache 1.3.x / Unix.

So, an mbstring-based fork may be desirable, because hosts do run mbstring.

15

Re: Indexing is not unicode-safe

This one helped me out:

http://punbb.org/forums/viewtopic.php?pid=54551#p54551

http://www.info-mob.com/forum/ - Croatian forum only, don't bother if you don't speak Croatian :)

16 (edited by Eugene One Gin 2006-07-12 19:22)

Re: Indexing is not unicode-safe

To summarize what I have seen so far in this forum:

1) 'lang_encoding'    =>'utf-8' in /lang/English/common.php
2) ini_set("default_charset", "utf-8"); in /config.php
3) $db->query("SET NAMES 'UTF8'"); in /include/dblayer/common_db.php
4) utf_general_ci for database
5) word varbinary(20) for 'word' column in 'search_words' table

IMHO (correct me if I am wrong here)

(1) puts charset=utf-8 in mega tags
(2) makes PunBB output charset=utf-8 in HTTP headers.
(3) and (4) allows inserting UTF-8 into the database
(5) gets rid of the error "Unable to insert search index words".

However, there is this code in search_idx.php

    $text = preg_replace($patterns, ' ', ' '.strtolower($text).' ');

which is not Unicode-safe. And at exactly this step, the text gets mangled and after that, the junk is inserted into the search_words table instead of the lowercase words.

In search.php, there is this code

strtolower(trim($_GET['keywords']))

and this passes the junk to the search function.

The strtolower alters the individual bytes (because it assumes that characters are single byte); therefore it is not suitable for altering UTF-8 strings, where some bytes are not altered the same way. When I look into the database with phpMyAdmin, I see question marks inside black losanges instead of Russian words.

The solution posted by Julik is

mbstring.func_overload = 7
mbstring.internal_encoding = UTF-8

Setting mbstring.func_overload to 7 to replace string functions with their multibyte equivalents automatically. I tested it and found it working with PHP5 and .htaccess (because it is INI_PERDIR in PHP5). In PHP4, that parameter is INI_SYSTEM which is not usable in shared hosting environment. In addition to that, if I use that parameter, there is a performance hit. I only need some of the function to be UTF-8-aware, not all of them.

String functions that do not operate on individual characters in strings should be (IMHO) UTF-8 safe, because UTF-8 is backwards-compatible with 8-bit single-byte encoding. Sorting should work too, to some extent (alphabetical case-sensitive Unicode sorting). Functions like changing case, however, are not UTF-8 safe. This is why I replaced strtolower with mb_strtolower function where strtolower occurred.

I just changed varbinary(20) back to varchar(20) and rebuilt the index. The forum still works and I did not get the pesky "Unable to insert search index words" error. So, it looks like varbinary is not necessary. Now I think that varbinary(20) was used to accomodate the characters mangled by strtolower function.

I will do another install (with the latest release instead of the 1.3-dev this time) and then will post my own instructions on how to make PunBB work with UTF-8 (including searching).

17

Re: Indexing is not unicode-safe

http://punbb.org/forums/viewtopic.php?pid=71384#p71384

You are right about utf-8 safe functions, but I don't think that you can use mb* functions on every hosting provider.

http://www.info-mob.com/forum/ - Croatian forum only, don't bother if you don't speak Croatian :)

18

Re: Indexing is not unicode-safe

And (3) enables inserting and reading to/from database in UTF-8...

http://www.info-mob.com/forum/ - Croatian forum only, don't bother if you don't speak Croatian :)

19

Re: Indexing is not unicode-safe

One more bad thing, split_words() in search_idx.php at line 63 trashes some of UTF strings badly. The solution is:

$text = trim(preg_replace('#\s+#u', ' ', $text));

It's just adding of 'u' to regex.

Without that it behaves badly on russian word pahat'.

Re: Indexing is not unicode-safe

I found this page, PHP UTF-8 cheatsheet, which is essentially a quick guide for converting PHP applications to UTF-8.

However, they do not mention the importance of issuing SET NAMES 'utf8' to the database and do not mention the PCRE_UTF8 /u modifier for regular expressions.

Re: Indexing is not unicode-safe

Solution for error Unable to insert search index words:
word varchar(40) or
word varbinary(40)

for non-8bit UTF-8 words when PHP mb_strlen($word) == 20 the MySQL length(word) == 40

Russian-reading users may look this: http://punbb.ru/viewtopic.php?id=1222

DigitalOcean: VPS from $5/mon. Get $10 bonus!.