This is an old revision of the document!
Searching index for Japanese language
We cannot search by Japanese words (either Chinese) at the latest version of PunBB(1.3.4).
It is because that the latest version of PunBB does not create searching words (table:search_words) appropriately.
The searching words is split and created at search_idx.php, but no Japanese words is created because Japanese words is splitted by white-spaces. Therefore, splitting Japanese sentence into Japanese words is very difficult.
How to solve
Using Yahoo Web API
Yahoo! Japan provides a sophisticated Web API tool to split Japanese sentences into the words. That is a kind of language morphological analysis.
Yahoo!Developers Japanese language morphological analysis (in Japanese)
This Web API tool enables us to make appropriate words for searching index.
Modification of code: search_idx.php
Insert the following code around line36@search_idx.php to split Japanese sentences into appropriate words splited by a whitespace.
// Split Japanese words by using Yahoo Web API if there are Japanese chars. $text = split_japanese_words($text);
Additional code: search_idx.php
The implementations are the following.
- YAHOO_API_CODE: Set Yahoo API Code. You must obtain it from Yahoo. Each application is restricted to access for 50,000 times in 24 hours.)
- function has_japanese_chars: Checking existence of Japanese charactters
- function split_japanese_words: Send request to Yahoo Web API with parameters. The filter is set to 9|10 to request 'noun' and 'verb' words. The words are extracted from XML data and are imploded.
if (!defined('YAHOO_API_CODE')) define('YAHOO_API_CODE','_F3TRHexg64WGN7BkqBt03OePtRKDon8qrFE6wWnEY.R7OWqPXVkHxJsTokT_Ijfa5w-'); /* * Check whether the text includes Japanese chars */ function has_japanese_chars($text) { $hiragana = mb_ereg('[ぁ-ん]', $text)?TRUE:FALSE; $katakana = mb_ereg('[ァ-ヶ]', $text)?TRUE:FALSE; return ($hiragana||$katakana); } /* * Analyze and split Japanese sentence by Yahoo!'s Web API. * This service needs Yahoo Web API app code. * See detail http://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html */ function split_japanese_words($text) { if(!has_japanese_chars($text)) return $text; $yahoo_get_url ='http://jlp.yahooapis.jp/MAService/V1/parse'; $yahoo_get_url.='?appid='.YAHOO_API_CODE; $yahoo_get_url.='&filter=9|10'; $yahoo_get_url.='&uniq_filter=9|10'; $yahoo_get_url.='&sentence='.rawurlencode($text); $xml = @file_get_contents($yahoo_get_url); if(preg_match('/filtered_count>(\d+)<\/filtered_count/',$xml,$fm)) { $f_count = $fm[1]; // filtered words count if($f_count>0) { preg_match_all('/surface>([^<]+)<\/surface/',$xml,$words); $text = implode(' ',$words[1]); } } return $text; }
Modification of SQL
See also this post in the forum.