1 (edited by Cubiq 2005-10-31 16:40)

Topic: search words bugs?

I just installed PunBB and I absolutely love it!

I was working on an implemented version of stopwords file in my language (italian) but looking at the search_word table I found some oddities most of them caused by punctuation.

I added the word "noi" ("we" in italian) in the stopwords.txt but if a post contains "noi." (noi + [period]), the period is stripped but the word "noi" is inserted into the search_word table anyway.

The same strange behaviour happens when a post contains something like "word..." (word + [3 periods]). The word is stored in the database as "word.." (word + [2 periods]).

Another example is for "dr.", it's a 2 letters word, so it should not be considered but instead it is counted as a 3 letters word then the period is stripped out and "dr" (without period) is inserted into the db.

Is this a bug or a feature? smile

Re: search words bugs?

I see the problem. For some reason, PunBB doesn't strip out periods. I can't believe I've missed that. I will fix it for the next release. Thanks for the detailed report.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

3

Re: search words bugs?

I don't know if with "next release" you meant v1.2.10 (it would have been the fastest bug fix ever...), anyway just to let you know that the bug is still there tongue (or at least the search_words table is not fixed using the "Rebuild search index" tool).

Re: search words bugs?

He did mean 1.2.10 http://dev.punbb.org/changeset/283

5 (edited by Cubiq 2005-11-01 09:05)

Re: search words bugs?

looking at file search_idx.php in 1.2.10 the changes pointed by Connorhd seem not to be present

        $noise_match =         array('[quote', '[code', '[url', '[img', '[email', '[color', '[colour', 'quote]', 'code]', 'url]', 'img]', 'email]', 'color]', 'colour]', '^', '$', '&', '(', ')', '<', '>', '`', '\'', '"', '|', ',', '@', '_', '?', '%', '~', '+', '[', ']', '{', '}', ':', '\\', '/', '=', '#', ';', '!', '*');
        $noise_replace =    array('',       '',      '',     '',     '',       '',       '',        '',       '',      '',     '',     '',       '',       '',        ' ', ' ', ' ', ' ', ' ', ' ', ' ', '',  '',   ' ', ' ', ' ', ' ', '',  ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '' ,  ' ', ' ', ' ', ' ', ' ', ' ');

UPDATE: yes, adding "." and " " to the match/replace arrays works like a charm.

Re: search words bugs?

Actually, that was an error. Adding it to the end of the march/replace arrays had some unwanted side-effects. I "re-removed" the period from those arrays and instead filtered it out with trim().

http://dev.punbb.org/changeset/284

"Programming is like sex: one mistake and you have to support it for the rest of your life."

7

Re: search words bugs?

Rickard wrote:

Actually, that was an error. Adding it to the end of the march/replace arrays had some unwanted side-effects. I "re-removed" the period from those arrays and instead filtered it out with trim().

http://dev.punbb.org/changeset/284

ho, I see... are side effects related to the indexing of web domains? (eg: www.punbb.org)

mmmh I guess there's something else that should be done for this issue then. Words in the stopwords file followed by a "." [period] are still inserted into the DB, and two letters words followed by a period are treated as 3 letters words (eg: "dr." is inserted as "dr").

Re: search words bugs?

Cubiq wrote:

ho, I see... are side effects related to the indexing of web domains? (eg: www.punbb.org)

Yes, but also for words such as "file.php".

Cubiq wrote:

mmmh I guess there's something else that should be done for this issue then. Words in the stopwords file followed by a "." [period] are still inserted into the DB, and two letters words followed by a period are treated as 3 letters words (eg: "dr." is inserted as "dr").

I can't see what's wrong with the current implementation. It filters out any periods at the beginning and the end of words, but it allows them in the middle of words.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

9

Re: search words bugs?

probably there's something I'm doing wrong.

I added the word "noi" in the stopwords but if a post contains "noi." (noi+[period]), "noi" (w/o the period) is inserted into the search_words. Instead if the word "noi" appears alone (w/o period) it is correctly stripped out.

You can try to copy some of your stopwords, put a period at the end of each of them and paste into a post. Looking at the search_words table you should see all the words even if they are stopwords.

I updated to 1.2.10 and executed the "Rebuild search index" but this issue still persists... any clue?

Re: search words bugs?

Ah, of course. I see the problem now.

                        $num_chars = pun_strlen($word);

                        if ($num_chars < 3 || $num_chars > 20 || in_array($word, $stopwords))
                                unset($words[$i]);

should be

                        $num_chars = pun_strlen($words[$i]);

                        if ($num_chars < 3 || $num_chars > 20 || in_array($words[$i], $stopwords))
                                unset($words[$i]);

It's not a the end of the world though, so I'll hold off on fixing it until 1.3.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

11

Re: search words bugs?

yes, it worked. Thank you man!

Gotta fuel your paypal account wink