Topic: Improved search with Porter Stemmer Mod
Here's a mod, which will integrate a Porter Stemmer into PunBB, thus
giving better search results and reducing the size of the search index
tables - see below for details. To try the enhanced search, go here.
Comments welcome! :)
Stemmer Mod version 1.3
===========
for PunBB v.1.1.x
Changes from initial release: problems would occur when rebuilding
the search index - fixed. Default search behavior "and" was added.
Fixed warnings when editing a post.
This modification is for english language boards only - it will
have no effect on boards where $language is not set to "en" in
"config.php".
By stemming keywords (reducing words to their basic form), you will
get improved search results - for example, a search for "explosive"
will also include posts with the word "explosion", and vice versa.
And as a side effect, your search tables will also become smaller,
as "explosion" and "explosive" will only be stored as one entry in
the keyword table; you would assume that this would also make
searches execute faster, but this is apparently not the case -
although my tests show a 30% smaller search table, there is also
some added overhead from the stemming operation itself, which
means that this mod makes searches about 2-3% slower.
To make this modification, first download the Stemmer class here:
http://www.chuggnutt.com/stemmer.php
Place the "class.stemmer.inc" file in your PunBB home folder, and
rename it to "class.stemmer.php" to avoid security issues.
Open "include/search_idx.php" and find this section of code:
// Split old and new post/subject to obtain array of 'words'
$words_message = split_words($message);
$words_subject = ($subject) ? split_words($subject) : array();
Insert the following section of code after it:
// Stem words:
global $language, $stemmer;
if ($language == 'en') {
$words_message = $stemmer -> stem_list ($words_message);
if (!$words_message)
$words_message = array();
$words_subject = $stemmer -> stem_list ($words_subject);
if (!$words_subject)
$words_subject = array();
}
Now go to the top of the file, and find this statement:
if (!defined('PUN'))
exit;
Insert the following code after it:
// Initialize the Stemmer:
if ($language == 'en') {
global $stemmer;
require ('class.stemmer.php');
$stemmer = new Stemmer();
}
Now open "search.php" and find this:
// Split up keywords
$keywords_array = preg_split('#[\s]+#', trim($keywords));
And insert the following code after it:
// Stem keywords
if ($language == 'en') {
require ('class.stemmer.php');
$stemmer = new Stemmer();
$keywords_array = $stemmer -> stem_list ($keywords_array);
unset ($stemmer);
}
That's it - now go to the admin control panel, and rebuild the search index!
Lastly, I would recommend the following small change - open "search.php" and
find this statement:
$match_type = 'or';
Change it to:
$match_type = 'and';
This will change the default search behavior to "and", which is how nearly
all other search engines work by default - the user will expect to be able
to narrow down the search by entering my keywords, as opposed to using "or"
by default, which will widen the search by entering more keywords; that is
not how other search engines work, e.g. Google uses "and" by default.