1 (edited by mindplay 2005-01-10 21:24)

Topic: Improved search with Porter Stemmer Mod

Here's a mod, which will integrate a Porter Stemmer into PunBB, thus
giving better search results and reducing the size of the search index
tables - see below for details. To try the enhanced search, go here.

Comments welcome! :)

Stemmer Mod version 1.3
===========

for PunBB v.1.1.x

Changes from initial release: problems would occur when rebuilding
the search index - fixed. Default search behavior "and" was added.
Fixed warnings when editing a post.

This modification is for english language boards only - it will
have no effect on boards where $language is not set to "en" in
"config.php".

By stemming keywords (reducing words to their basic form), you will
get improved search results - for example, a search for "explosive"
will also include posts with the word "explosion", and vice versa.
And as a side effect, your search tables will also become smaller,
as "explosion" and "explosive" will only be stored as one entry in
the keyword table; you would assume that this would also make
searches execute faster, but this is apparently not the case -
although my tests show a 30% smaller search table, there is also
some added overhead from the stemming operation itself, which
means that this mod makes searches about 2-3% slower.

To make this modification, first download the Stemmer class here:

http://www.chuggnutt.com/stemmer.php

Place the "class.stemmer.inc" file in your PunBB home folder, and
rename it to "class.stemmer.php" to avoid security issues.

Open "include/search_idx.php" and find this section of code:

    // Split old and new post/subject to obtain array of 'words'
    $words_message = split_words($message);
    $words_subject = ($subject) ? split_words($subject) : array();

Insert the following section of code after it:

    // Stem words:
    global $language, $stemmer;
    if ($language == 'en') {
        $words_message = $stemmer -> stem_list ($words_message);
        if (!$words_message)
            $words_message = array();
        $words_subject = $stemmer -> stem_list ($words_subject);
        if (!$words_subject)
            $words_subject = array();
    }

Now go to the top of the file, and find this statement:

if (!defined('PUN'))
    exit;

Insert the following code after it:

// Initialize the Stemmer:
if ($language == 'en') {
    global $stemmer;
    require ('class.stemmer.php');
    $stemmer = new Stemmer();
}  

Now open "search.php" and find this:

                    // Split up keywords
                    $keywords_array = preg_split('#[\s]+#', trim($keywords));

And insert the following code after it:

                    // Stem keywords
                    if ($language == 'en') {
                        require ('class.stemmer.php');
                        $stemmer = new Stemmer();
                        $keywords_array = $stemmer -> stem_list ($keywords_array);
                        unset ($stemmer);
                    }

That's it - now go to the admin control panel, and rebuild the search index!

Lastly, I would recommend the following small change - open "search.php" and
find this statement:

                $match_type = 'or';

Change it to:

                $match_type = 'and';

This will change the default search behavior to "and", which is how nearly
all other search engines work by default - the user will expect to be able
to narrow down the search by entering my keywords, as opposed to using "or"
by default, which will widen the search by entering more keywords; that is
not how other search engines work, e.g. Google uses "and" by default.

Re: Improved search with Porter Stemmer Mod

It's a nice mod. I will try it out later tonight.

I wonder which implementation is superior, the one from chuggnutt.com or the original by Martin Porter (here). The chuggnutt.com page says it's a "fairly faithful implementation" of the orignal algoritm. Not sure what that means though.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

It's largely just a translation, as far as I can tell - the original was written in Java.

Re: Improved search with Porter Stemmer Mod

note - For those who need stemming algorithms in other languages, check the snowball page, which has stemmers in C++ (and some very complete stop word lists) in french, spanish, portuguese, italian, german/dutch, swedish, norwegian, danish, russian and finnish.

Re: Improved search with Porter Stemmer Mod

mindplay wrote:

It's largely just a translation, as far as I can tell - the original was written in Java.

But there's a PHP implementation on Porters page as well.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

oh smile ... didn't notice. Well, the php version on porter's page does appear to be the 6-step algo, as opposed to the one I used, which appears to be the 5-step algo ... I'm not sure which one is newer though, the 5-step or the 6-step?

Re: Improved search with Porter Stemmer Mod

6 is a higher number than 5? :D

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

Yes, but that doesn't necessarily mean it's newer? Five could be more effective than six, who knows.

Re: Improved search with Porter Stemmer Mod

I was kidding.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

Duely noted - I'll move you up a notch on my funny-scale wink

Re: Improved search with Porter Stemmer Mod

Thanks. I needed that.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

I've updated the mod (see edited post at the beginning of this page) to resolve some warnings that would occur when editing an existing post.

13 (edited by bobitt 2004-12-23 23:19)

Re: Improved search with Porter Stemmer Mod

Seems really nice... but will this work for other languages than english?

EDIT: oops didnt read good enough.. smile seems it wont work for other languages.. too bad..

Most people are other people. Their thoughts are someone else's opinions, their lives a mimicry, their passions a quotation - Oscar Wilde

Re: Improved search with Porter Stemmer Mod

Stemmers are available for a number of different languages here - however, they're available only in C and Java, so someone would have to translate them to PHP ... or you could use stem-php (A PHP module) if you know how to build it, and if you have root access to install it on your server...

15

Re: Improved search with Porter Stemmer Mod

okej sure might try it... big_smile

Most people are other people. Their thoughts are someone else's opinions, their lives a mimicry, their passions a quotation - Oscar Wilde

Re: Improved search with Porter Stemmer Mod

I'm thinking about an updated version of this mod, in which I would use metaphones to further enhance the search ...

this would enable searches similar to those on google, where for example, a search for "inovative" would produce a link at the top of the page saying, did you mean "innovative" - that is, it'll help you out if you can't spell your search terms correctly.

this would of course be implemented as a "fall-back" only - that is, if a search for "inovative" returns no results, it will then try a search for something that sounds like "inovative" instead.

this would mean an increase in search index tables, and a times two increase in search time for searches that would otherwise return zero results - so of course I would make this extra feature optional.

any comments on this idea?

Re: Improved search with Porter Stemmer Mod

How would that be implemented in practice? Wouldn't you have to index the metaphone string as well as the word itself?

The search feature in PunBB is by far the least attractive part of the package. I would much rather use MySQL's built-in fulltext indexing, but that will lead to problems with PostgreSQL and SQLite.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

Yes - I would add an indexed column to the 'search_words' table, containing the metaphone representation of the words.

Why is the search feature unattractive? It is working very well for us - we've got 11.000 posts and only about 15.000 words in the search index, that's not even 1.5 words per post, that's pretty good ... and it doesn't seem to grow significantly anymore.

I've tried to rebuild the search index with and without the stemmer mod - with stemming, the search index for our body of 11.000 posts is 15.026 words, without stemming it's 20.789 words, which is about 30% more ... which probably means a speed increase of 30%.

I may have some other ideas for optimization, I have to do some tests - it has to be a generic solution though, something that works for all SQL engines. I'll let you know if I find anything useful.

Re: Improved search with Porter Stemmer Mod

Well hey, will you look at that ...

I guess I assumed wrong - even though the search word base is 30% smaller, this apparently makes no real difference in terms of performance at all. In fact, with the stemmer mod enabled, it's slightly slower - I guess the stemming operation itself takes a little time too.

To profile the search engine, I'm executing 20 specific searches in row, and measuring the total time taken to execute each - the 20 searches are executed over again 10 times, and the average total time for those 20 searches calculated.

For added accuracy, I'm executing this test 10 times over, and the shortest total average execution time is the one I use.

Shortest execution time with the stemmer mod enabled: 4.090 seconds

Shortest execution with original search engine: 3.981 seconds

So it's about 2-3% slower.

Re: Improved search with Porter Stemmer Mod

Then again, the stemmer search probably returns better results.

mindplay wrote:

Why is the search feature unattractive? It is working very well for us - we've got 11.000 posts and only about 15.000 words in the search index, that's not even 1.5 words per post, that's pretty good ... and it doesn't seem to grow significantly anymore.

The problem arises when you've got millions of posts. A homegrown PHP indexing service just doesn't cut it. We've had to switch over to mysql fulltext indexing at Sweclockers.com due to vBulletin's own search indexing absolutely killing performance. Fulltext is still hard on the server, but no where near what the PHP solution is.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

Hmm, well still - there might be room for optimization.

I examined the SQL query times, and broke down some of the joint queries to see which part of them is taking up the time, and it seems (as you would expect) the bottleneck is the lookup in search_matches ... the bottleneck operation seems to be:

SELECT m.post_id FROM search_matches AS m WHERE m.word_id=...

So I wonder, can we speed this up somehow?

Some ideas: use a secondary index to group the records in thousands, or maybe tenthousands - I'm not sure if this would just add more overhead, what do you think?

Or, separate the records into multiple tables with a thousand or tenthousand records in each - this way you'd get smaller indexes. Although again, this might just add more overhead?

Maybe such "clever tricks" have already all been tried...?

Re: Improved search with Porter Stemmer Mod

Yes, I'm sure any such variants have been tested before. If properly indexed, databases generally have no problem searching through large masses of data. It's essentially what they were made for. I'm not sure what you mean by "use a secondary index to group the records in thousands, or maybe tenthousands" though.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

I meant, adding another indexed column, containing 0 for ids 0-999, 1 for ids 1000-1999, 2 for ids 2000-2999 etc. - this should give a small index for this column, which might help locate the correct group of records faster ... probably the underlying implementation already makes similar optimizations though hmm

Re: Improved search with Porter Stemmer Mod

Aha, I see. Well, like you said, I do believe MySQL itself is generally better at optimizing such operatings.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

How about separating into individual, numbered tables? Keeping maybe 10.000 records per table - since MySQL uses one physical file per table, maybe this will speed things up, as you will also get a considerably smaller index on each table?