Topic: [HOWTO] - Create a stopwords list

The information below only applies to language pack authors.

All PunBB language packs contain a list of stopwords. These words are words that help us humans communicate, but don't nessecarily "mean" anything or doesn't have any real "search value". The english word "the" is a classic example. Other classics are that, this, for and with. Stopwords are excluded from the search index and thus makes search faster and makes PunBB take up less space in the database. Stopwords are sometimes referred to as noisewords. Here's how to create a stopwords list:

1. Try to obtain a list of the most common words in your language (google is your friend here). This step isn't nessecary, but it can help you a lot. See "Alternative to 1" below if you are unable to find such a list.

2. Go through the first 100 words or so and pick out words that you consider to be stopwords. Words that are shorter than three characters should be left out of the stopwords list as they are ignored by the search engine anyway. Also, stopwords must not contain spaces, quotes or any other "special characters".

3. You should now have a list of anything between 20 and 200 words. If your list is shorter than 20 words or longer than 200 words, start over :)

Alternative to 1. If you already have a forum setup with posts in your language, you can run a database query to determine what the most common words are in your forum. The query looks like this:

SELECT sw.word, COUNT(sm.post_id) AS hits FROM search_words AS sw INNER JOIN search_matches AS sm ON sw.id = sm.word_id GROUP BY sw.id ORDER BY hits DESC LIMIT 50

The query will display the 50 most common words currently in the forum. You can then use that list to determine what words should be included in the stopwords list. Please note that not all words in that list are stopwords. Some words might be very common even though they aren't stopwords.

Important! The stopwords list in your language should NOT just be a translation of the English stopwords list. It should be a list of words that are considered stopwords in your language. A lot of stopwords in one language are also stopwords in another language, but just translating the English stopwords list doesn't help at all. It only makes things worse.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: [HOWTO] - Create a stopwords list

Also note that special chars like ' and others should not be in the stopword list. Right? :P

Re: [HOWTO] - Create a stopwords list

Good idea. I'll add it to the text.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

4

Re: [HOWTO] - Create a stopwords list

200 words?  Are these turned into a hash to cross check against words before saving to a search database?  What happens if you use 300 words?

I'm having problems with my phpbb and checking out alternative boards (I'm using 300 words right now).

Re: [HOWTO] - Create a stopwords list

300 is no problem. You could use 3000 if you wanted to, but posting performance would decrease.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: [HOWTO] - Create a stopwords list

A good rule of thumb is to use conjunctions as stopwords:

for
and
nor
but
or 
yet
so

And articles:

a
an
the

And other determiners:

this
that
these
those
which
what

Re: [HOWTO] - Create a stopwords list

Rickard wrote:

...

The query looks like this:

SELECT sw.word, COUNT(sm.post_id) AS hits FROM search_words AS sw INNER JOIN search_matches AS sm ON sw.id = sm.word_id GROUP BY sw.id ORDER BY hits DESC LIMIT 50

...

What kind of DB does it work on?

Re: [HOWTO] - Create a stopwords list

scottywz: It should work on any DB. If you're using a table prefix, you'll have to insert it in front of search_words and search_matches though.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: [HOWTO] - Create a stopwords list

How do i make "Sub Forum" in mypunnbb?

i really need help with this.

Thanks.

Plx!! i want my own forum

Re: [HOWTO] - Create a stopwords list

How is that at all related to this topic?
And anyway, you're using MyPunBB: that means you can't have subforums, since Connor hasn't installed the mod.
Plus, next time, ask for MyPunBB support in the MyPunBB forums: http://www.mypunbb.com/forum.php

11 (edited by Kilise 2005-11-23 14:27)

Re: [HOWTO] - Create a stopwords list

Smartys wrote:

How is that at all related to this topic?
And anyway, you're using MyPunBB: that means you can't have subforums, since Connor hasn't installed the mod.
Plus, next time, ask for MyPunBB support in the MyPunBB forums: http://www.mypunbb.com/forum.php

so you mean that i cant make a sub forum ? sad

Bah, i really need a sub forum.. hmm

Plx!! i want my own forum

Re: [HOWTO] - Create a stopwords list

Then host your own forums wink
Using MyPunBB is a tradeoff: you don't have to worry about webspace, etc but you lose the ability to add your own mods wink

13 (edited by Kryo 2006-01-23 22:35)

Re: [HOWTO] - Create a stopwords list

i'm currently translating the punbb to the portuguese language (the available pack, is rather incomplete and somehow bad), and i'd like to know if the letter ç is categorized as a special caracter. this letter is very used in our language, and i'm not sure if its a special caracter

forgot to meation: in portuguese we use a lot the simbols ` ´ ^ ~. can the ascii code be used in the stopword list?

Re: [HOWTO] - Create a stopwords list

I wasn't aware PunBB had this feature, but, since it does, there should be an option to allow stopwords in a search... I can think of a couple situations you'd maybe want it... maybe.

Check Maid maid service

Re: [HOWTO] - Create a stopwords list

CReatiVe4w3 wrote:

I wasn't aware PunBB had this feature, but, since it does, there should be an option to allow stopwords in a search... I can think of a couple situations you'd maybe want it... maybe.

The whole point of stopwords is that they're so common they're not helpful: by having them you optimize the search tongue