Translations of this page: en bg cs de fi fr hu it ja pl ru tr zh

This is an old revision of the document!


Searching index for Japanese language

We cannot search by Japanese words (either Chinese) at the latest version of PunBB(1.3.4).

It is because that the latest version of PunBB does not create searching words (table:search_words) appropriately.

The searching words are split and created at search_idx.php, but no Japanese word is created. Because Japanese words in a sentence is not split by whitespaces or other chars. In order to split appropriate words, we need some large Japanese dictionary and a specific analytical program. It's not small work.

How to solve

Using Yahoo Web API

Yahoo! Japan provides a sophisticated Web API tool to split Japanese sentences into the words. That is a kind of language morphological analysis.

Yahoo!Developers Japanese language morphological analysis (in Japanese)

This Web API tool enables us to make appropriate words for searching index.

Modification of code: search_idx.php
Insert the following code around line36@search_idx.php to split Japanese sentences into appropriate words splited by a whitespace.

// Split Japanese words by using Yahoo Web API if there are Japanese chars.
$text = split_japanese_words($text);

Additional code: search_idx.php

The implementations are the following.

  1. YAHOO_API_CODE: Set Yahoo API Code. You must obtain it from Yahoo. Each application is restricted to access for 50,000 times in 24 hours.)
  2. function has_japanese_chars: Checking existence of Japanese charactters
  3. function split_japanese_words: Send request to Yahoo Web API with parameters. The filter is set to 9|10 to request 'noun' and 'verb' words. The words are extracted from XML data and are imploded.
if (!defined('YAHOO_API_CODE'))
  define('YAHOO_API_CODE','_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-');
 
/*
 * Check whether the text includes Japanese chars
 */
function has_japanese_chars($text)
{
  $hiragana = mb_ereg('[ぁ-ん]', $text)?TRUE:FALSE;
  $katakana = mb_ereg('[ァ-ヶ]', $text)?TRUE:FALSE;
  return ($hiragana||$katakana);
}
/*
 * Analyze and split Japanese sentence by Yahoo!'s Web API.
 * This service needs Yahoo Web API app code.
 * See detail http://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html
 */
function split_japanese_words($text)
{
  if(!has_japanese_chars($text)) return $text;
 
  $yahoo_get_url ='http://jlp.yahooapis.jp/MAService/V1/parse';
  $yahoo_get_url.='?appid='.YAHOO_API_CODE;
  $yahoo_get_url.='&filter=9|10';
  $yahoo_get_url.='&uniq_filter=9|10';
  $yahoo_get_url.='&sentence='.rawurlencode($text);
 
  $xml = @file_get_contents($yahoo_get_url);
  if(preg_match('/filtered_count>(\d+)<\/filtered_count/',$xml,$fm))
  {
    $f_count = $fm[1]; // filtered words count
    if($f_count>0)
    {
      preg_match_all('/surface>([^<]+)<\/surface/',$xml,$words);
      $text = implode(' ',$words[1]);
    }
  }
  return $text;
}

Modification of SQL

See also this post in the forum.


Personal Tools