1 (edited by vankon 2008-11-19 05:10)

Topic: The search enginee in Chinese (or non English)

punbb version 1.3 final

The search engine is very well for English, but is not good for Chinese.
it only can get Result for English, no Result for Chinese!

i debug the code ,find that

include/search_functions.php
@108 'WHERE' statement

$query = array(
                        'SELECT'    => 'm.post_id',
                        'FROM'        => 'search_words AS w',
                        'JOINS'        => array(
                            array(
                                'INNER JOIN'    => 'search_matches AS m',
                                'ON'            => 'm.word_id=w.id'
                            )
                        ),
                        'WHERE'        => 'w.word LIKE \''.$forum_db->escape(str_replace('*', '%', $cur_word)).'\''
                    );

$query value is like

SELECT m.post_id FROM search_words AS w INNER JOIN search_matches AS m ON m.word_id=w.id WHERE w.word LIKE '数据库'

my debug code is

echo 'search_functions.php@116@query=SELECT '.$query['SELECT'].' FROM '.$query['FROM'].' INNER JOIN '.'search_matches AS m'. ' ON '.'m.word_id=w.id'.' WHERE '.$query['WHERE'];

url is http://cilinux.cn/forum2/search.php?act … C%E7%B4%A2

the keywords is '数据库', in English is 'database'

the final sql statement missing '%' around keywords.

maybe ths bug in all non English language.

the code

'WHERE'        => 'w.word LIKE \''.$forum_db->escape(str_replace('*', '%', $cur_word)).'\''

modify to

'WHERE'        => 'w.word LIKE \'%'.$forum_db->escape(str_replace('*', '%', $cur_word)).'%\''

then OK for keyword: '数据库'

but other keywords also have no Result.

most language is single byte/character, but Chinese is double byte/character(word) ,
in utf-8, most Chinese Word is 2 byte, other is 3byte or 4 byte.

this is a problem!

so on.

2 (edited by vankon 2008-11-19 05:01)

Re: The search enginee in Chinese (or non English)

I don't know how the search enginee work. smile:):)

3 (edited by vankon 2008-11-19 05:07)

Re: The search enginee in Chinese (or non English)

other: the smilies will not work if without blank before smilies code.

quote

smile:):):)

code

:):):):)

quote

smile smile smile smile

code

:) :) :) :)

4 (edited by vankon 2008-11-19 06:04)

Re: The search enginee in Chinese (or non English)

The most nicety search statements is follow, but performance is not good!

select * from f_topics where id in
(select id from f_topics where subject like '%keywords%'
union 
select topic_id id from f_posts where message like '%keywords%')
order by id;

Re: The search enginee in Chinese (or non English)

vankon wrote:

most language is single byte/character, but Chinese is double byte/character(word) ,
in utf-8, most Chinese Word is 2 byte, other is 3byte or 4 byte.

this is a problem!

This should be fine. Russian (my native) is 2 bytes in UTF-8 too.

vankon wrote:

str_replace('*', '%', $cur_word)

This replaces * with % in SQL query. You should search *数据库* if you want search *database* and 数据库 for just database - these may give two different results. Using %word% every time is not good as I will find all the carpet, scare, careful instead of just car I was really looking for.
Do I miss something?

vankon wrote:

the smilies will not work if without blank before smilies code

We suppose this is a feature, not a bug :-)

Carpe diem

Re: The search enginee in Chinese (or non English)

Thank you for answer my questions.

In China, people usually use keyword and not use *, and any people use a lots of smiles .
This is a difference between the habit.

continue...

I use keyword '*数据库*', it show 10 records, but i use sql statement, has 12 records.
I don't know why the search result is difference.

I don't know how punbb search the keyword,

my sql statement is :

/*start*/
/*prefix with f_ */
select count(id) from f_topics where id in
(select id from f_topics where subject like '%数据库%'
union
select topic_id id from f_posts where message like '%数据库%')
/*end*/

How about your sql statement ?

Look forward to your reply.(<-- This statement is a translation provided by Google, I understand little English, )

Re: The search enginee in Chinese (or non English)

今天在网上搜了一下,中文的搜索是一个很大的难题,对中文分词是很不容易的事情,不像拉丁文每个词之间使用空格分隔,所以PunBB或者其它的论坛对中文的搜索效果不佳是正常的,现行的各种论坛中文搜索也只是简单的对关键字进行%keyword%搜索,而PunBB搜索的提取单词对中文无效的,只能提取大量的句子,而且搜索结果并不准确。

哪位懂中文和英文的朋友,请把这段中文翻译成英文。机器翻译总是效果不佳的。


Today, a search on the Internet about the Chinese search is a big problem for the Chinese word is never easy, unlike Latin spaces between each word separated, so PunBB or other forum on Chinese The poor search results is normal, the current forum for the various Chinese search only for a simple keyword search %keyword%, and the PunBB search on the word of Chinese extraction null and void and will only serve to extract a large number of sentences, and search results Is not accurate.

Who are literate in Chinese and English friends, during the Chinese translated into English. Machine translation is always with poor results.

Re: The search enginee in Chinese (or non English)

vankon wrote:

punbb version 1.3 final

The search engine is very well for English, but is not good for Chinese.
it only can get Result for English, no Result for Chinese!

i debug the code ,find that

include/search_functions.php
@108 'WHERE' statement

$query = array(
                        'SELECT'    => 'm.post_id',
                        'FROM'        => 'search_words AS w',
                        'JOINS'        => array(
                            array(
                                'INNER JOIN'    => 'search_matches AS m',
                                'ON'            => 'm.word_id=w.id'
                            )
                        ),
                        'WHERE'        => 'w.word LIKE \''.$forum_db->escape(str_replace('*', '%', $cur_word)).'\''
                    );

$query value is like

SELECT m.post_id FROM search_words AS w INNER JOIN search_matches AS m ON m.word_id=w.id WHERE w.word LIKE '数据库'

my debug code is

echo 'search_functions.php@116@query=SELECT '.$query['SELECT'].' FROM '.$query['FROM'].' INNER JOIN '.'search_matches AS m'. ' ON '.'m.word_id=w.id'.' WHERE '.$query['WHERE'];

url is http://cilinux.cn/forum2/search.php?act … C%E7%B4%A2

the keywords is '数据库', in English is 'database'

the final sql statement missing '%' around keywords.

maybe ths bug in all non English language.

the code

'WHERE'        => 'w.word LIKE \''.$forum_db->escape(str_replace('*', '%', $cur_word)).'\''

modify to

'WHERE'        => 'w.word LIKE \'%'.$forum_db->escape(str_replace('*', '%', $cur_word)).'%\''

then OK for keyword: '数据库'

but other keywords also have no Result.

most language is single byte/character, but Chinese is double byte/character(word) ,
in utf-8, most Chinese Word is 2 byte, other is 3byte or 4 byte.

this is a problem!

so on.

这个非常好……谢谢分享

Re: The search enginee in Chinese (or non English)

Anatoly wrote:
vankon wrote:

most language is single byte/character, but Chinese is double byte/character(word) ,
in utf-8, most Chinese Word is 2 byte, other is 3byte or 4 byte.

this is a problem!

This should be fine. Russian (my native) is 2 bytes in UTF-8 too.

vankon wrote:

str_replace('*', '%', $cur_word)

This replaces * with % in SQL query. You should search *数据库* if you want search *database* and 数据库 for just database - these may give two different results. Using %word% every time is not good as I will find all the carpet, scare, careful instead of just car I was really looking for.
Do I miss something?

Yeah, this is not 'Multibyte' problem.  It's because whether 'a sentence is space-split or not'. Either Chinese or Japanese string doesn't have whitespaces between words. So explode(' ', $keywords) at line 76 does nothing...

I put some code to insert/add '%' only to multibyte searching $cur_words. This fix solves unnecessary burden by wildcards '%' for every words, moreover, this fixes 'car/carpet/careful' problem if they are not multibyte string.


at line 99 If multibyte string, $mbAdd = '%'.

$mbAdd = 
  (mb_strlen($cur_word,'UTF-8')==strlen($cur_word,'UTF-8')) ?
  '' : '%';

at line 108 (109)  $mbAdd as prefix and postfix to $cur_word

'WHERE' => 'w.word LIKE \''.$mbAdd.$forum_db->escape(
  str_replace('*', '%', $cur_word)).$mbAdd.'\''

Could somebody using Chinese or Japanese try and test this ?

Re: The search enginee in Chinese (or non English)

erratum: strlen() does not need encoding...

$mbAdd = (mb_strlen($cur_word,'UTF-8')==strlen($cur_word)) ?
  '' : '%';

Re: The search enginee in Chinese (or non English)

Russian letters are encoded in two bytes in UTF-8, and Russian words are separated by spaces. So your condition will not work for Russian. Maybe we have to make an entry in lang files containing the word separator.

Re: The search enginee in Chinese (or non English)

Splitting Japanese sentence into correct words is very difficult, so I used an web API tool provided by Yahoo! Japan.

http://developer.yahoo.co.jp/webapi/jlp … parse.html (in Japanese)

I've just written an wiki page to solve a problem of searching in Japanese.

http://punbb.informer.com/wiki/searchin … r_japanese


P.S. Please add an option of 'ja' to 'translation this page' at Wiki !!

13

Re: The search enginee in Chinese (or non English)

iobataya wrote:

P.S. Please add an option of 'ja' to 'translation this page' at Wiki !!

Done.