1

(88 replies, posted in Archive)

Took a closer look (on the textdrive, actually).

Created a test for the ut8-string, and parsed it with mb_strlen.
By default mb_internal_encoding() is set to ISO-8859-1 (not a good choice), but once I updated it to utf-8,  everything became shining bright ? mb_strlen shows correct size of the string, and I have yet to check how work mb_* analogs for strto(lower|upper) functions.

And -- the main cheer up is the /u modifier in the preg_* functions:

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32.

It basically means that once you get a reasonable set-up and sane hosting company (giving you mb_string in PHP and allowing you to set mysql to store strings in UTF instead of some single national charset -- and utf8 allows correct sorting and substring searches), you are open to new adventures with much less i18n hassle on your hands.

2

(88 replies, posted in Archive)

??? ???? 1251: ????? ??????????????? ????? ???? ????????, ?????, ???????? ? ??? ??????????????? ?????. ??? ???? ????? ??????????? ? ?????? ???????????? ????????? (??? ? ????????) ??? ????? ?? ??????????? ?????????????? (??? ? ????????), ?? ??? ?? ???????????? ???????.

3

(88 replies, posted in Archive)

*sigh*
Looks like it will be a long wait, even though work on the native unicode support was presented more than a year ago:
http://www.phpn.org/item/15777_IPC_Day_ … azine.html

Also there's a post by Sam Ruby on the issue:
http://intertwingly.net/blog/2004/10/01 … d-Unicode/

This means I'll have to fight for mb_string inclusion into the php with hosting support. And it means delayed support for unicode in projects like PunBB and others, because there's only a limited number of uses for a software that can't handle multibyte strings correctly. And it raises a question if it is better to have mb_string support for the little number of installations (which will have the mb_string enabled), or not bother with developing this branch at all.

Bleh. How depressing these thing can get at times.

4

(88 replies, posted in Archive)

Thanks for pointing it out, I'll take a closer look at that (because I am building a multilanguage system that has to work with UTF8 from LiveJournal servers and I plan to move to TextDrive for its unbeatable offer of being the most technically advanced hosting service).

5

(88 replies, posted in Archive)

What percentage? Very little, since it's a widespread issue and until now everybody seemed to get away with either ignoring multilanguage issues, or sticking with two language-setup (english + some national encoding).

But PHP changes. Slowly but steadily, hosting providers upgrade and migrate to PHP5, which is unicode-friendly and uses it as a native charset (as do Java and Python, for years already). So going the unicode/PHP5 way would in fact imply limited backwards compatibility, but with a promise to bring more accessible, more universal application.

But what I was talking about is simple replacement of "iso-8859-1" with "utf-8" in the lang pack to force all browsers in all user locales to stick to unicode (since official punbb forum itself became multilanguage, and for cyrillic charset implied encoding of iso-8859-* doesn't help anyway ? without enabling russian locale in PHP sorting suffers, even though string functions work fine).

6

(88 replies, posted in Archive)

? ??????? ? Byte Order Mark (BOM) -- ??? ?? ??? ???????????? ?? ??????????? W3C. ??????? ????? ????? ?????? ?????????? ??????? ? ??????, ? ???????? ? ??? ?????? ? ???????.

?????-?? ?????? ??????? ??? ????, ????? ???????? ???8-?????, ?? ????????? ? ?????????? ?????? ?????????? Content-Type, ? ????????? ? ??? charset=utf-8. ??? ?????? ????? ??? ????????.

?? ????, ??? ???????? -- ???-8 ? ?????. ?? ??? ???? ???? ???????: GMail, ? ???????, ??????? ?????? ?????????? ????? ? ???????, ???????? ???? ??????????????? ?? ???? ????????????? utf8 ? base64. ??? ??? ?????? ???? ?????? ?????? ??? ? Outlook Express.

7

(9 replies, posted in Archive)

Èíòåðåñíî, êîíå÷íî, êàêèì îáðàçîì ÷ïó ìåøàåò îòíîñèòåëüíûì ññûëêàì.

8

(88 replies, posted in Archive)

By the way, it would be quite a wise move to alter the default language pack from iso-8859-1 to UTF-8, because it will allow browsers not bother with choosing necessary charset and on post just dumping everything out as unicode. To my great surprise, this works reasonably fine even with older MySQL 3.23 (even though I realize that I lose correct sorting on the text fields this way). And here, on Textdrive fully unicode-compatible MySQL 4.1 should handle unicode stuff just perfectly.

Overhead of multibyte charset doesn't really look like a threat to me ? gzip/deflate is here for more than ten years now, and I have yet to see the real benchmarks which would show significant (i.e. tens of percent) advantage of gzipped single-byte over gzipped multi-byte. And, after all, UTF-8 doesn't hurt former iso-8859-1 pages even a little, because english text stays always in its natural single-byte form but added features (accomodating other languages/accented characters) really expands possibilities.

9

(88 replies, posted in Archive)

??? ??? ?????????? ???? ???????????? UTF8-?????? lang-????, ??????? ? ???? ???????? ??????, ??????? ????? ?????, ?? ?????, ? ?????????? ???????? ?? ? UTF-8, ?????? ???????? ?? ??????, ??????? ??? ????. ???? ????-?? ??? ?????:
http://ambience.ru/tools/punbb.Russian-UTF8.rar

? ????? ??????, ????? ???????, ???? ru-UTF8 ???????? ????? ? ??????? ??????? ru-1251 ? Downloads, ?????? ??? ??? ?????? ?????????? ??????? ? ?????? ?? ???????, ???????, ??????, ????? ??? ???? ?? ???????, ??? ??? ?? ??????????????? (??? ?????) ???????? ???????? ????????? ????????????. ? ??? ????? ???????? ????? (???? ??????) ???????.