日本語の検索インデックス

最新のPunBB(1.3.4)では、日本語や中国語での検索ができない。これは、PunBBの内部でキーワードの検索インデックスが正しく作成されていないからである。(table:search_words)

検索キーワードは、'search_idx.php'スクリプト内で作成されており、BBCodeの排除や半角の記号などで区切ることによって複数の単語が抽出されているが、この方法では日本語の単語は抽出されない。日本語は文章中の単語を半角のスペースで区切ることがないからである。もし、正しく日本語の文章を区切ろうと思ったら、それなりの日本語辞書と高度な解析プログラムが必要になりそうである。

解決するには

Yahoo Web APIを使う方法

Yahoo! Japanは、日本語形態素解析と呼ばれる、文章を品詞まで認識して単語に分解するツールを無償でWeb APIとして公開している。

Yahoo!Developers 日本語形態素解析

このツールを使ってPunBBの検索インデックスを作成できれば、日本語でフォーラム内の検索ができるようになる。

コードの修正: search_idx.php

search_idx.phpの36行あたりに次のコードを挿入する。返り血は半角スペースで区切られた日本語の単語になる。

// Split Japanese words by using Yahoo Web API if there are Japanese chars.
$text = split_japanese_words($text);

漢字一文字でも意味があることがあるので、検索キーワードの最小数を１に変更する。

if (!defined('FORUM_SEARCH_MIN_WORD'))
	define('FORUM_SEARCH_MIN_WORD', 1);

コードの追加: search_idx.php

上で呼び出される関数の実装。

YAHOO_API_CODE: Yahoo APIを利用するための必要なアプリケーションIDである。Yahooから無償で提供される。アプリケーションにつき24時間あたり50,000回のアクセスが可能とのことである。トピック数がそれに近い場合には注意。
function has_japanese_chars: 日本語が含まれるかどうかを判定する。
function split_japanese_words: Yahoo Web APIに文章とパラメータを送る。取得する単語のフィルターは名詞と動詞にしてある。結果はXMLデータで取得されるので、そこから抽出する。

if (!defined('YAHOO_API_CODE'))
  define('YAHOO_API_CODE','_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-');
 
/*
 * Check whether the text includes Japanese chars
 */
function has_japanese_chars($text)
{
  $hiragana = mb_ereg('[ぁ-ん]', $text)?TRUE:FALSE;
  $katakana = mb_ereg('[ァ-ヶ]', $text)?TRUE:FALSE;
  return ($hiragana||$katakana);
}
/*
 * Analyze and split Japanese sentence by Yahoo!'s Web API.
 * This service needs Yahoo Web API app code.
 * See detail http://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html
 */
function split_japanese_words($text)
{
  if(!has_japanese_chars($text)) return $text;
 
  $yahoo_get_url ='http://jlp.yahooapis.jp/MAService/V1/parse';
  $yahoo_get_url.='?appid='.YAHOO_API_CODE;
  $yahoo_get_url.='&filter=9|10';
  $yahoo_get_url.='&uniq_filter=9|10';
  $yahoo_get_url.='&sentence='.rawurlencode($text);
 
  $xml = @file_get_contents($yahoo_get_url);
  if(preg_match('/filtered_count>(\d+)<\/filtered_count/',$xml,$fm))
  {
    $f_count = $fm[1]; // filtered words count
    if($f_count>0)
    {
      preg_match_all('/surface>([^<]+)<\/surface/',$xml,$words);
      $text = implode(' ',$words[1]);
    }
  }
  return $text;
}

SQLの修正

SQLで全てを部分一致検索にするという方法

http://punbb.informer.com/forums/post/119577/

日本語の検索インデックス

解決するには

Yahoo Web APIを使う方法

SQLの修正

Views

Navigation

Personal Tools

Search

Toolbox