Topic: Purpose of search tables

I'm trying to figure out the purpose of the search tables....as well they are becomming quite large and backing up is starting to suck

forums_search_matches                571,736 - 15.9 MB
forums_search_results                3 - 21.3 KB     
forums_search_words                57,703 - 2.4 MB 

Anyways I'm to lazy to search through the code, so I figured I would ask

I enjoy pie :)

Re: Purpose of search tables

search_words contains a list of all individual words entered in all posts. search_matches is a list of word-to-post relations. search_results is a cache for search results when search results span several pages.

It's basically the search functionality of phpBB adapted to PunBB. The code I wrote for the search was very much inspired by phpBB.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Purpose of search tables

The purpose are that you don't have to search through all the posts/topics for your words. Instead the database just search through these tables to find out which posts to show to the user.

Re: Purpose of search tables

And if I empty the tables?

I enjoy pie :)

Re: Purpose of search tables

Your search won't work. You can however make the tables a bit smaller by adding common stopwords to your language packs stopwords file. Try running the query:

SELECT sw.word, COUNT(sm.post_id) AS hits FROM search_words AS sw INNER JOIN search_matches AS sm ON sw.id = sm.word_id GROUP BY sw.id ORDER BY hits DESC LIMIT 50

What you'll get is a list of the 50 most common words in the search_matches table. If you find any words in that list that you consider to be stopwords (i.e. words that we use to communicate but doesn't really "mean" anything), you can add them to your stopwords list and rebuild the search index in admin/maintenance.

Edit. If you feel like it, please post your results here. I might have missed a few common stopwords myself. I've added the following stopwords in 1.2: now, well, ive, really, thats, ill. Please note that all words are lowercased and singlequotes and such stuff is removed before the words are indexed, so "I've" should be "ive".

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Purpose of search tables

how come the records in forums_search_matches are almost 30 bytes each? it only contains two integers and a tiny, so it shouldn't be more than 9 bytes or so?

although if you dump your tables as flat text files, the size makes sense ... isn't there some way to back up in binary with mysql instead of dumping as textfiles?

hmm...

Re: Purpose of search tables

hmm... I had to test aswell ... some more stopwords for english perhaps wink
(from 9 registered users, 522 topics and 7547 posts.) wink

think, 1203
thats, 1006
ive, 740
actually, 478
etc, 476
btw, 386

some of the words in my top 50 ... that's about 4000 that I can remove ... i.e. smaller tables smile

Re: Purpose of search tables

mindplay wrote:

how come the records in forums_search_matches are almost 30 bytes each? it only contains two integers and a tiny, so it shouldn't be more than 9 bytes or so?

although if you dump your tables as flat text files, the size makes sense ... isn't there some way to back up in binary with mysql instead of dumping as textfiles?

hmm...

You have indexes that take up space aswell ... looks like this for me

Space usage :
Type     Usage
Data     2,028     KB
Index     3,842     KB
Total     5,870     KB

that is ... the index is bigger than the two ints and the tiny ...

Re: Purpose of search tables

Frank H wrote:

You have indexes that take up space aswell ... looks like this for me

er, yes, but surely your indexes don't get dumped - those are reconstructed as the exported SQL statements execute back into the database, if you have to restore it.

of course if you're talking about the filesize of the binaries on your hard drive, then yes. but most people don't back those up directly, since you can't do so without first stopping your server.

Re: Purpose of search tables

Frank H: Cool. Thanks. Maybe I'll add "think" to the stopwords as well. I can't think of a situation where you would want to seach for the word think.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

11 (edited by Jansson 2004-05-11 14:23)

Re: Purpose of search tables

What about "how to think" then? ;)

Re: Purpose of search tables

well, think twice!

my jokes just reached an all-time low!

Re: Purpose of search tables

Cogito ergo sum.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Purpose of search tables

word  hits  
script 1442
good 1437
well 1318
post 1315
now 1231
think 1199
people 1091
need 1067
work 1039
ill 972
thats 939
scar 937
help 894
time 866
new 838
rs2 812
com 787
try 782
back 737
still 709
right 701
say 693
someone 654
because 653
first 653
really 652
shit 629
made 621
give 598
add 593
sure 588
doesnt 588
name 583
program 554
nice 551
start 549
banned 547
fucking 546
works 541
scripts 536
anyone 523
fuck 500
msn 496
stuff 491
read 491
yeah 485
ive 482
again 481
look 480
wont 477

Ok......well my forums has 3030 users and 76693 posts.....the results of the search where a bit scewed as I had to empty the search tables a few weeks ago as they where killing my backups.

Anyways......thats what I got

I enjoy pie :)

Re: Purpose of search tables

wow, I see similarities ...
'think', 'thats' & 'ill' is very high for you too smile

hmm... no need to backup the search indexes ... as you can rebuild them in the forum admin panel, it takes time, but you have time for a cofeebreak (I did it myself after adding a few new stopwords). (I hope I'm correct, othervise Rickard probably see this wink)

Re: Purpose of search tables

Mediator: Great. I see a few more words that could be added to the list in 1.2.

Frank H: You're correct :)

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Purpose of search tables

Ah jesus......I am such a newb, never even noticed that function.....anyways here are the new results from the search

word  hits  
good 5370 
now 5056 
think 5026 
work 4956 
well 4781 
need 4693 
post 4568 
help 4179 
time 3868 
ill 3745 
people 3676 
thats 3613 
new 3275 
try 3157 
doesnt 2859 
still 2849 
right 2823 
really 2712 
com 2709 
back 2702 
because 2697 
say 2598 
program 2572 
first 2542 
someone 2535 
made 2534 
works 2435 
sure 2381 
add 2357 
anyone 2337 
ive 2281 
name 2229 
nice 2143 
again 2127 
same 2095 
shit 2089 
banned 2074 
look 2056 
start 2040 
give 2027 
put 2014 
tell 2005 
using 1972 
working 1912 
yeah 1900 
wont 1880 
fucking 1870 
stuff 1824 
maybe 1823 
read 1804 
trying 1777 
change 1761 
used 1736 
done 1731 
site 1695 
says 1694 
man 1690 
fuck 1682 
anything 1674 
fix 1671 
msn 1636 
wrong 1633 
edit 1630 
while 1604 
better 1599 
little 1526 
problem 1525 
sorry 1454 
down 1431 
long 1422 
mean 1407 
found 1404 
mine 1403 
check 1402 
plz 1397 
already 1368 
few 1359 
might 1358 
free 1354 
I enjoy pie :)

Re: Purpose of search tables

"fucking" is quite high on all forum word lists, it's remarkable how rude people on forums can get ... ah, fuck it! wink

Re: Purpose of search tables

I few more candidate stopwords -- these appeared in my common words and are also included in the two lists above:

now well good time back still sure really

Re: Purpose of search tables

sleddog: I'll revise the list for 1.2. Not sure all of those words are stopwords though. I'm primarily thinking of "good" and "time".

"Programming is like sex: one mistake and you have to support it for the rest of your life."

21 (edited by sleddog 2004-05-18 10:17)

Re: Purpose of search tables

I agree. "Good" and "time" might be very significant words depending on the forum context -- for example, in a forum about racing.

Thinking about this, it seems that what is or isn't a stopword is quite dependent on the nature of the forum. What is a stopword for one forum may not be for another.

Which suggests the idea of a script to allow forum administrators to customize his/her stopwords; either a standalone script or something incorporated into the admin panel. I'd see it doing something like this:

1. On first run, briefly explain the concept of stopwords and why a forum administrator may want to customize the list;
2. On clicking a "Next" button, produce a list of common words (like the query above), except that the output would be loaded in a form, with each word a checkbox item;
3. The admin would checkmark some common words, and click a "Add to stopword list" button at page bottom.

Re: Purpose of search tables

An alternative would be to just inform the administrator of the existence of the stopwords list and suggest that he/she review it and make any changes they feel suitable for that perticular forum.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Purpose of search tables

Yes, that's an alternative. Though making changes to the stopword list would be greatly facilitated by having available a most-commonly-used wordlist generated from the forum database. If the forum administrator does not have access to a tool like phpmyadmin and isn't comfortable writing his/her own query script, there's no way of generating that commonly-used wordlist.

Re: Purpose of search tables

But the query to generate that list can take a very long time to execute. I recently ran it on the forum database at sweclockers.com and I believe it took over a minute. By then, the script has terminated.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Purpose of search tables

you could always cheat, and force the server to let you run it a long time....I use ini_set("max_execution_time","999999"); all the time.....then again most people don't have a 12 gig mysql of md5 hashes, and their matching words

I enjoy pie :)