Topic: search engine thoughts
The search engine in PunBB is remarkably fast, and this is of course a great strength. However...
Try a search for "cats" on this forum - one topic with the title "Wedding" shows up.
Now try a search for "cat" - a bunch of posts turn up, but the topic with the title "Wedding" is not among them.
This demonstrates that the search engine is apparently not very accurate.
What's missing is a stemmer - an algorithm that breaks down words to their basic forms ... before the post is indexed, and before a search is executed, you reduce words to their simplest form; "cats" becomes "cat", "flying" becomes "fly" etc. - similar words are then indexes as if they were the same, which means the search index tables become smaller, the search becomes even faster, and the search becomes a lot more accurate and useable.
There's an open-source stemming algo available here:
http://www.chuggnutt.com/stemmer.php
The problem is that this particular stemming algo works only for one particular language, namely english.
Some research has been done into language-independent stemming, and from what I've heard, it should actually be possible, although I have no idea how, or if the technology is free or open - there's some basic information here:
http://www.dei.unipd.it/~ims/multilingual.html
The search could also be made more accurate by implementing a vector space - this of course with some performance hit, and search index grow. A vector space basically means that you record how many times a word occurs in a post, not just whether it occurs or not - and when searching, posts with a given keyword repeated a higher number of times, rank higher. A more detailed vector space explanation is here: