Re: Improved search with Porter Stemmer Mod

I doubt it. Also, that will be a whole lot of tables as the forum grows. The search index at sweclockers.com was around 45 million rows long before we switched to mysql fulltext indexing.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

Then how about indexing topics instead of posts? In search_matches, we currently have:

post_id <-> posts.id
word_id <-> search_words.id
subject_match

Suppose we now change it to:

topic_id <-> topics.id
post_id -> posts.id
word_id <-> search_words.id
subject_match

And then store only one record per topic, instead of one per post - the post_id you store for every topic, would be simply the last occurence of the word in each topic ... the way that "search for topics" works, it only displays each topic once, so the information about which other posts may have contained the same word in the same topic, is essentially never used, and as far as I can figure, could be safely discarded?

Of course then, "search for posts" wouldn't be possible - but the output from this mode of search is extremely confusing to browse anyways, as everything is displayed out of context ... I don't know anyone who uses that feature.

Even so, if you wanted to preserve this feature, you could do both - have two tables, one that indexes posts and one that indexes topics ... of course, this would make your search tables possibly take up a considerable amount of extra space. But searches for topics is definitely a lot more common than search for posts - I would guess at least 90% of all searches are probably searches for topics; if those 90% of all searches were working on a table that is probably one fourth of the size (assuming an average 4 posts per topics, which is probably low anyways), that should speed things up considerably.

What do you think?

Re: Improved search with Porter Stemmer Mod

I don't think the number of rows the search index contains is a problem per se. Properly indexed, searching through a table with a huge amount of rows is relatively fast for the database. There's topic somewhere on foruminsider.com discussing search indexing. It was quite interesting.

Also, I use "show as posts" all the time. I agree, it can be confusing, but if you know what you're searching for, it sure beats wading through 10 pages of posts.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

Then what is the problem? ... is it simple the number of search queries? which undoubtedly is also a lot higher on forums with that much activity. If so, maybe we could improve things by somehow caching common searches? On a huge forum with a certain topic, there must be lots of searches that get repeated over and over again?

Re: Improved search with Porter Stemmer Mod

I think the problem is that PunBB need to "juggle" around with quite large arrays of ID's. I haven't had time to do any real profiling on this, but I think that's where the bottleneck is. If the database is properly indexed and configured and you have performance issues, it is almost always in the scripts. I very seldom see a website being limited by e.g. MySQL. Apache/PHP are the thiefs.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

Well, would there be some way to get your IDs from the query as one comma-separated string, instead of as an array? You need to pass it in comma-separated form to subsequent SQL queries anyways, maybe there is no point in using arrays in the first place?

You should maybe try adding more timers around more blocks of code, not just SQL queries, to discover exactly where the bottleneck is, and then see if there's any way to optimize on it? Of course, for starters, try timing the SQL queries, then compare against the total time taken to generate the entire page, to see how much of the total time is spent by PHP vs how much time is spent by the SQL engine?

Re: Improved search with Porter Stemmer Mod

I would use something like XDebug instead. It provides much more accurate information.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

I didn't know this type of tool even existed - looks very useful ... I'm not sure I have time to get into it now though, I think I should rather concentrate on finishing PunMod first wink

Re: Improved search with Porter Stemmer Mod

Rickard, take a look here:

http://wordindex.sourceforge.net/

This search engine is incredibly fast - and as far as I can see, the only major difference between this system and yours, is that the wordlinks (links from word IDs to post IDs) is split into multiple tables. Despite your faith in MySQL's ability to search a single enormous table quickly enough, this does not seem to be the case.

The indexer is done in PERL, but as said, is not really that different from yours, as far as I can tell - the table structures are very similar as well, but you may want to examine them for key differences; for one, a double-key is used in the word-link tables, I think this may be one of the key points to it's high performance.

The example on the website indexes 22.000 posts with 3 million wordlinks split across 100 tables, e.g. 20-50 thousand wordlinks per table. A search typically takes less than one second. Now, I know you may have forums with 100 times that number of posts - but the thing is, according to the programmer of this search engine, it doesn't have to get any slower, you can just split across an even greater number of tables; he has successfully split 14 GB of posts over 1000 tables, and still been able to search them in less than one second even on a "modest server".

The technique used to determine which table a word is located in, is simple - just take the modulus of the word ID and the number of tables, e.g. word 12345:

12345 % 100 = table 45

The search engine itself is done in PHP, so it should be easy for you to compare it to yours.

Just try the demo on the website, and you'll see, it's incredibly fast smile

I don't know if splitting across multiple tables is the only key to it's speed, there may be other key factors in the PHP script...

Re: Improved search with Porter Stemmer Mod

If you don't have time, let me know, and I'll see if I can find time to do some tests with PunBB smile

Re: Improved search with Porter Stemmer Mod

Rickard: upgraded to PunBB v1.2.5 - I wanted to do some search testing ... in common.php I enabled the following:

define('PUN_DEBUG', 1);
define('PUN_SHOW_QUERIES', 1);

but it doesn't seem to work properly, it doesn't display all the queries? namely the interesting queries - word lookups and wordlink lookups - are not shown in the output??

Re: Improved search with Porter Stemmer Mod

You'll have to remove the header redirect (to search.php?id=12345) in order to see those queries.

I'll look into the search indexer you linked to, but I probably won't be able to do it until sometime next week. Feel free to play with it if you have the time.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

I have been doing a closer examination of the data structures.

You have chosen to make `search_words`. `word` a VARCHAR(20), which doesn't really seem to make any sense ... a CHAR would give faster searches ... of course, this comes at the cost of more diskspace usage, but in my database with about 15.000 words, it only grew from 590KB to 648KB - because yes, the word records as such will take up more space, but the search index will also take up less space. So this seems like a sensible change - using VARCHAR instead of CHAR for strings as small as 20 chars, doesn't really make any sense.

This seems to be the only relevant difference between your table structures and the ones used in wordindex.

Re: Improved search with Porter Stemmer Mod

Ah, maybe that's not such a bad idea. Have you measured any noticeable speed increase?

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Improved search with Porter Stemmer Mod

As you noted earlier, a major bottleneck at the moment seems to be the PHP script time itself, perhaps more so than the actual database queries, so the above change may give a barely noticeable performance increase for most searches...

So I'm looking at your search code, and there is a major difference between the way you do it, and the way it's done in wordindex.

PunBB walks through the wordlinks one at a time (the while-loop starting at line 167 in search.php).

Wordindex gets all of the wordlinks, sticks them in one array per word, and then uses one function call to combine those arrays.

Clearly, if you can get away with one function call instead of a whole bunch of function calls, this is going to give you a massive speed advantage in those cases where a word has lots of wordlinks, because looping through them in your script is a slow and heavy process compared to what a native, compiled function can achieve.

So what you might want to do, is try to avoid using loops - use single calls instead, and you should get a considerable performance increase.

Strangely enough, the MySQL library apparently does not provide a function to get all rows from a result to an array with one function call? You will still have to get get them one at a time and push them to an array.

Once you have all of your wordlink arrays (one array per word) filled, you can then join them with a single function call, which should give a notable speed increase.

Try array_intersect() for "OR" searches, and array_diff() for "AND" searches ... should considerably faster than doing similar operations manually?

Re: Improved search with Porter Stemmer Mod

Rickard wrote:

Ah, maybe that's not such a bad idea. Have you measured any noticeable speed increase?

I haven't time the queries to see what's faster, no - I'm out of time for today, sorry smile