1 (edited by Yann 2008-04-11 11:12)

Topic: Punbb search & large boards

Hello everybody,

Our board continues to grow (>1.5M posts) - and I had to disable the search as it was just unusable. Php uses more than 50megs of RAM for not-so-complex searches, even when the sql server not being very loaded the searches take a long time, and because of myisam table-level locks the search result in the board appearing to be frozen most of the time. The result of the searches are also most of the time completely irrelevant.

Has anybody tried some more advanced searches, like lucene, or sphinx? I am also following wikiasearch quite closely... What are the other options that I have? sad

Re: Punbb search & large boards

In 1.3 there will be fulltext search. I think that should be faster...

It can't be too long until it will be released (at least I hope so)

FluxBB - v1.4.8

Re: Punbb search & large boards

As lie said, you'll just have to be patient wink
Last time you brought this up, I think I mentioned the fulltext modification. Remind me again why that's not suitable?

4

Re: Punbb search & large boards

Hello smarty,

Because it was buggy and enabled people to search in forums they shouldn't have access to smile I don't think fulltext is a suitable solution for a big board; it's probably better, but not good enough. There will probably still be locks, and the search results won't be more accurate...

I am more looking for something like xapian, sphinx, lucene.. If anybody has time.. smile

Re: Punbb search & large boards

"Because it was buggy and enabled people to search in forums they shouldn't have access to"
And as I pointed out to you:
http://www.punres.org/viewtopic.php?pid=15699#p15699

"There will probably still be locks"
They shouldn't be nearly as bad

"and the search results won't be more accurate"
Do you have something to base that statement on or are you just saying it? tongue

"I am more looking for something like xapian, sphinx, lucene"
Someone could develop it as an extension for 1.3, but I doubt anyone is going to devote the time and effort at this point to completely rewriting 1.2's search to use a new system.

Re: Punbb search & large boards

what systems are used in phpbb or vb or smf or ipb ...

just asking

MyFootballCafe.com  is Now Online!

Re: Punbb search & large boards

Smartys wrote:

...but I doubt anyone is going to devote the time and effort at this point to completely rewriting 1.2's search to use a new system.

Not all punbb webmasters is in ecstasy over 1.3 and it's fulltext search. As for me, fulltext has critical disadvantages.
I'm working on some 1.2 search improvements.

DigitalOcean: VPS from $5/mon. Get $10 bonus!.

Re: Punbb search & large boards

what are disadvantages of fulltext search .... i mean it is used in vb and its fast on boards that have more then 1 million users ....

here is what vb search use:

Search Type 
vBulletin supports two types of search indexing. Fulltext searching uses a search index that is constructed by MySQL itself, whereas vBulletin's own search feature uses its own index.

You set the search type here:

Admin CP -> vBulletin Options -> Search Type

By default, vBulletin will use its internal indexing feature. The results of this indexing process is stored in two tables, word and postindex. This provides a fast search mechanism but can cause problems on larger forums due to the ever increasing size of these tables. Each unique word is indexed in the word table and each occurrence of the word is indexed in the postindex table. To get around the large amount of space these tables can occupy we implemented MySQL Fulltext Search. The search type screen allows you to switch between the two of these. It is a simple toggle so submitting the screen switches between the two modes.

When switching a forum to the fulltext search mode, you will want to consider emptying the indices that the default search engine built. These indices are not used by the fulltext search and consume a large portion of your database. You should be certain that you are going to permanently use the fulltext search before removing these indices since, generally, it takes a lot of time and server load to rebuild these indices. Another consideration is during any time that the fulltext option is enabled, these indices will not be updated by any new posts. Using fulltext search for an extended period of time will leave these indices stale and you may still wish to rebuild them.
Note:
The minimum and maximum length of words to be indexed is defined by the ft_min_word_len and ft_max_word_len system variables (available as of MySQL 4.0.0). The default minimum value is four characters. The default maximum depends on your version of MySQL. If you change either value, you must rebuild your FULLTEXT indexes. For example, if you want three-character words to be searchable, you can set the ft_min_word_len variable by putting the following lines in an option file:

[mysqld]
ft_min_word_len=3

Then restart the server and rebuild your FULLTEXT indexes. Also note particularly the remarks regarding myisamchk in the instructions following this list.

For more on Fulltext Search from MySQL please visit:
http://dev.mysql.com/doc/refman/5.0/en/ … uning.html
You can also empty these indices in the Update Counters section of Maintenance.

You may want to optimize the postindex and word tables afterwards by going to the Repair / Optimize Tables section of Maintenance.

MyFootballCafe.com  is Now Online!

Re: Punbb search & large boards

artoodetoo wrote:
Smartys wrote:

...but I doubt anyone is going to devote the time and effort at this point to completely rewriting 1.2's search to use a new system.

Not all punbb webmasters is in ecstasy over 1.3 and it's fulltext search. As for me, fulltext has critical disadvantages.
I'm working on some 1.2 search improvements.

artoodetoo: For example?
And if you like 1.2's search so much, you could probably reimplement it as an extension. I wouldn't spend my time writing a "better" search for 1.2 though, unless you think you'll be using it for a while.

10 (edited by artoodetoo 2008-04-23 11:20)

Re: Punbb search & large boards

Smartys, you're happy 8bit human. big_smile

I'm needy Russian and my bitter experience is: fulltext gives the strange results, sometimes.
It's usual that US and other latin-alphabet people cannot understand a problem.

B.T.W. You think 1.3 is unicode ready? I don't think so...

So, I have to "reinvent the wheel".

DigitalOcean: VPS from $5/mon. Get $10 bonus!.

Re: Punbb search & large boards

artoodetoo: As Smartys said, it will be easy to use the 1.2 search on 1.3, but apart from that, what ideas do you have for improving the 1.2 search?

Re: Punbb search & large boards

artoodetoo: It's hard to understand a problem when the only given information on it is "there's a problem. you won't understand it." wink

Re: Punbb search & large boards

Smartys:

First of all, UTF-8 is still one-byte encoding while all your text is latin. This is why you and other developers and testers can't here me.
But there are millions and millions of users uses non-latin languages (multibyte in UTF-8).

1. Fulltext search is not relevant enough for Russian languages (and very probably for others).
The solution is giving the OPTION Fulltext|Legacy search for board admin.

2. punbb 1.3 uses neither mbstring extension nor "workarounds". PHP itself IS NOT unicode-compatible. All of string function such as strtolower strlen etc. will produce WRONG UNPREDICTABLE RESULTS for multibyte strings. This trouble IS NOT cureable by extensions, as general. There are too many string functions in code...
The (undescribed) solution might be function overloading. Unfortunatelly it is not possible on most hostings.

The worst thing is forum owner forced to change base engine code. I hope you'll understand me at last.

Best regards!

DigitalOcean: VPS from $5/mon. Get $10 bonus!.

14

Re: Punbb search & large boards

How do other open source LAMP forums (eg phpBB, Phorum, SMF etc) handle non-Latin language full-text search?

15 (edited by artoodetoo 2008-05-09 08:55)

Re: Punbb search & large boards

I don't know how about others. Is it UTF-8 ? Is it care about multibyte?

Look at PunBB, search.php:

        $keywords = (isset($_GET['keywords'])) ? strtolower(trim($_GET['keywords'])) : null;

In UTF-8 version strtolower brakes cyrillic text BEFORE real search.

I've made some test script with this fragment for illustration. testmb.php

<?php
    mb_internal_encoding('utf-8');

    $keywords = (isset($_GET['keywords'])) ? strtolower(trim($_GET['keywords'])) : null;
    $keywords_len = strlen($keywords);
    $author = (isset($_GET['author'])) ? strtolower(trim($_GET['author'])) : null;

        $keywords_mb = (isset($_GET['keywords'])) ? mb_strtolower(trim($_GET['keywords'])) : null;
    $keywords_len_mb = mb_strlen($keywords_mb);
    $author_mb = (isset($_GET['author'])) ? mb_strtolower(trim($_GET['author'])) : null;

?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>UTF-8 text transformation test</title>
<style>
FORM {padding: 0; margin: 0; float: left; overflow: hidden; width: auto;}
FIELDSET, DIV, P {padding: 0; margin: 0; overflow: hidden; width: auto; clear: both;}
SPAN.lbl {position: absolute; left: 10px;}
P {padding: 10px; position: relative;}
INPUT {float: left; margin-left: 10em;}
SPAN {font-weight: bold;}
</style>
</head>
<body>
    <form action="testmb.php" method="get">
    <fieldset>
        <p><label><span class="lbl">Keywords:</span> <input name="keywords" type="text" value="<?php echo htmlspecialchars(isset($_GET['keywords']) ? $_GET['keywords'] : '') ?>" /></label></p>
        <p><label><span class="lbl">Author:</span> <input name="author" type="text" value="<?php echo htmlspecialchars(isset($_GET['keywords']) ? $_GET['author'] : '') ?>" /></label></p>
        <p>
        <input type="submit" value="OK" />
        </p>
    </fieldset>
    </form>

    <div>
        <h3>Like in PunBB (non-multibyte):</strong></h3>
        <p><span>keywords:</span> <?php echo $keywords ?></p>
        <p><span>keywords len:</span> <?php echo $keywords_len ?></p>
        <p><span>author:</span> <?php echo $author ?></p>

        <h3>Multibyte:</h3>
        <p><span>keywords(mb):</span> <?php echo $keywords_mb ?></p>
        <p><span>keywords len(mb):</span> <?php echo $keywords_len_mb ?></p>
        <p><span>author(mb):</span> <?php echo $author_mb ?></p>
    </div>

</body>
</html>

Screenshot 1 (Windows): broken characters
http://img167.imageshack.us/img167/3504/testmbkc5.th.gif

Screenshot 2 (Unix): case not changed
http://img242.imageshack.us/img242/1893/testmb2fn8.th.gif

may be it depends on locale or PHP version... but in both cases it wrong because PHP is not multibyte in core. it should use mbstring extension!

Edited: I've add  strlen/mb_strlen into example code
Live example with russian text: http://tlogr.com/testmb.php?keywords=%D … 0%B0%D0%BD

P.S. I found why and when Windows default behaviour is deffer then Unix. In phpbb3 :

// Enforce ASCII only string handling
setlocale(LC_CTYPE, 'C');

when I copy this part into test script both installatons do the same (as in Unix screenshot)

DigitalOcean: VPS from $5/mon. Get $10 bonus!.

Re: Punbb search & large boards

And CMS, and Wiki, and a lot of others softwares, by the way...

17 (edited by artoodetoo 2008-05-09 08:52)

Re: Punbb search & large boards

a lot of software uses mbstring or it is not php
and (i think) they uses international testers in development stage

Compare to latest PHPBB:

B.T.W. phpbb3 engine tests presence of mbstring extension and if it not found emulates them.
also, phpbb3 give the choice how to search: by default it NOT uses mysql fulltext. and it care about non-latin languages

http://img514.imageshack.us/img514/8149/phpbbsearchadminef9.th.gif

phpbb tables for searching is very similar to punbb 1.2 : there are wordlist, wordmatch and searchresult tables

My Request:
Dear PunBB developers!
- Next PunBB should have administrative option to choose between mysql fulltext and native (like in 1.2) search engines.
- As general all string functions should be multibyte. If mbstring extension is not available, punbb shoud use its own workaround.

DigitalOcean: VPS from $5/mon. Get $10 bonus!.

Re: Punbb search & large boards

artoodetoo: And THAT is exactly what we needed to hear. Thanks for the report, I'll look into it. smile

Re: Punbb search & large boards

I hope so. Thank you!

DigitalOcean: VPS from $5/mon. Get $10 bonus!.

20

Re: Punbb search & large boards

Bump. smile

I know the wagon has moved on here, in a sense, but is this still part of the development effort for PunBB/FluxBB?

I know the new PunBB devs (as cyrillic users) might be more interested in the issues of both full-text and also multibyte, but did anything make it into any new code anywhere - for any of the 1.2* or 1.3 families?

Re: Punbb search & large boards

sirena wrote:

Bump. smile

I know the wagon has moved on here, in a sense, but is this still part of the development effort for PunBB/FluxBB?

I know the new PunBB devs (as cyrillic users) might be more interested in the issues of both full-text and also multibyte, but did anything make it into any new code anywhere - for any of the 1.2* or 1.3 families?

Regarding UTF-8 support: this should be completely done by now. Report any issues please.

About very large forums: we didn't make our own benchmarks and tuning yet. But this is one of our tasks. Nevertheless, something like sphinx will always be a better decision (we'd had to reimplement sphinx inside PunBB otherwise))).

Carpe diem