1 (edited by mindplay 2004-05-14 13:45)

Topic: search engine thoughts

The search engine in PunBB is remarkably fast, and this is of course a great strength. However...

Try a search for "cats" on this forum - one topic with the title "Wedding" shows up.

Now try a search for "cat" - a bunch of posts turn up, but the topic with the title "Wedding" is not among them.

This demonstrates that the search engine is apparently not very accurate.

What's missing is a stemmer - an algorithm that breaks down words to their basic forms ... before the post is indexed, and before a search is executed, you reduce words to their simplest form; "cats" becomes "cat", "flying" becomes "fly" etc. - similar words are then indexes as if they were the same, which means the search index tables become smaller, the search becomes even faster, and the search becomes a lot more accurate and useable.

There's an open-source stemming algo available here:

http://www.chuggnutt.com/stemmer.php

The problem is that this particular stemming algo works only for one particular language, namely english.

Some research has been done into language-independent stemming, and from what I've heard, it should actually be possible, although I have no idea how, or if the technology is free or open - there's some basic information here:

http://www.dei.unipd.it/~ims/multilingual.html

The search could also be made more accurate by implementing a vector space - this of course with some performance hit, and search index grow. A vector space basically means that you record how many times a word occurs in a post, not just whether it occurs or not - and when searching, posts with a given keyword repeated a higher number of times, rank higher. A more detailed vector space explanation is here:

http://www.perl.com/lpt/a/2003/02/19/engine.html

Re: search engine thoughts

I'm not sure I understand what you mean by accurate. I would argue that searching for cat and getting a hit from the word cats means that the search engine is _inaccurate_. The search engine does support wildcards so searching for cat* is quite possible.

Stemming is interesting. I've looked at it before. Not sure I want to implement it though. An option to regular stemming is a soundex/levenshtein algoritm that basically searches based on how similar words sound. It suffers the same issues with non-english languages though.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: search engine thoughts

Rickard wrote:

I would argue that searching for cat and getting a hit from the word cats means that the search engine is _inaccurate_

in that case, you'll be arguing against everyone who's ever implemented a search engine before you wink

think about it: if I'm searching for "PunBB", and an important post doesn't turn up in my results because the post said "PunBB's nice features" somewhere in it, and the word was indexed as "PunBBs". This is one of the first problems with developing a search engine.

"More accurate" is not simply the same as "Less results" - in many cases, leaving out results because people use a different form of a word (singular/plural etc.) gives you less accurate results. The technical aspects of searching are one thing, but you have to consider that searching is a human activity - we're not machines, and we do not want to have to learn how to write RegEx expressions or even simple wildcards, before we can get down to business - the computer has to do the dirty work for us, that's what it's there for wink ... Leaving out results because people fail to spell correctly is another cause for inaccurate results, which is why phonetic searching was invented. These (and many other) are the reasons why modern search engines like Google are so effective...

Re: search engine thoughts

I agree that searching for cat and getting hits on cats would be great. It would make the search engine better. However, the term accurate hardly applies.

Anyway, since there apparently seems to be no effective way of doing multilingual stemming or phonetic searches, I wonder why we are even having this discussion?

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: search engine thoughts

You're so full of positive energy, Rickard - a real inspiration.

wink

Re: search engine thoughts

Everyone comes to a point when they can't be all nice and friendly all the time. Honestly, do you think a stemming engine would be suitable for PunBB?

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: search engine thoughts

As an option, sure - why not? most boards are in english anyways, but of course it should be entirely optional, and still work without it. You have an extremely fast search engine - apparently faster than in most PHP based boards - why not make it great as well? smile

Re: search engine thoughts

Because the search behavior will then be different for different languages. It will be confusing. Also, I'm not to fond of the idea of having two separate search features.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: search engine thoughts

I don't see how it could be confusing? The search feature as such won't change from the users point of view, at all - it'll simply function better, probably a lot more like the user would expect.

And there's no need to change any tables or alter the behavior of the code at all, the only change will be a couple of calls to the stemmer here and there.

I'll post a mod.

Re: search engine thoughts

Mod posted! smile