Topic: spambots illustrate problems with punbb data model

Some time ago (measured in years) I posed  questions about the data model commonly used in php applications like punbb and wordpress-- like why isn't it normalized; why does punbb keep track of post counts in tables, instead of counting as part of a query; why isn't the user id always used, instead of sometimes using the username (poster)?

I was told that it was for 'efficiency' and responsiveness, so the database didn't have to have to do all those calculations.  That didn't satisfy me then, and I'm now seeing problems with the data model as I try to prune spam.

It's an inconvenience to have to write pruning queries using id to prune posts, but 'poster' to prune topics.  But then, none of the topic counts and page counts are correct, because these are maintained by the application instead of the database!  So still more queries needed to count them up by groups, and then manually correct them. 

it would be so much better IMHO to use the database instead of the app -- they're fast for this kind of thing.  And it also allows scalability (multiple web servers hitting one or more punbb databases; which can't be done if topic and post counts are handled by the app.

Re: spambots illustrate problems with punbb data model

It is for efficiency, in some cases: the cost of running the query every time the number is needed outweighs the cost of running the query once and storing the number and keeping it up to date.
In terms of user post count, there's also the fact that we don't want it to decrement automatically when a post is deleted.
In terms of poster/poster_id, what happens when Guests post? We need to store their poster name somewhere and the posts table is where we do it.

Now, if you explain a little more clearly what problems you're having with pruning, we might be able to help you wink

Re: spambots illustrate problems with punbb data model

There's also a cost to not being scalable.  and my conjecture is there are more cycles needed to run a php interpreter than an optimized query, but that's irrelevant really.  These are design choices, and I respect (even if I disagree with) the decisions.
But, I think that along with those decisions comes the need to provide tools that allow intelligent pruning.



Now for  more explanation:

1. I manually deleted:

delete from pun_posts where poster_id=21
delete from pun_topics where poster='amandaloan'

2.  I then noticed that the post and topic counts were erroneous, and there were more page links built than needed.

3. I tried to rebuild the index, hoping that would figure things out, but I get an error:

An error was encountered
Error: Unable to update table auto_increment.

Can you help me with the queries needed to reset the topic and post counts, please?
Something like
update pun_posts set num_replies =  ....

I'm running mysql4, so no subselects.  probably need to loop this in php.

thx.

Re: spambots illustrate problems with punbb data model

I would argue that PunBB is scalable. My understanding of replication is somewhat limited, but I believe you would have one master, several slaves, make all selects come from the slaves, and all inserts/deletes/updates go to the master. I don't see any case where PunBB makes that difficult.

Deleting all posts/topics by a user: check out it's done when users are deleted by PunBB.

        // Should we delete all posts made by this user?
        if (isset($_POST['delete_posts']))
        {
            require PUN_ROOT.'include/search_idx.php';
            @set_time_limit(0);

            // Find all posts made by this user
            $result = $db->query('SELECT p.id, p.topic_id, t.forum_id FROM '.$db->prefix.'posts AS p INNER JOIN '.$db->prefix.'topics AS t ON t.id=p.topic_id INNER JOIN '.$db->prefix.'forums AS f ON f.id=t.forum_id WHERE p.poster_id='.$id) or error('Unable to fetch posts', __FILE__, __LINE__, $db->error());
            if ($db->num_rows($result))
            {
                while ($cur_post = $db->fetch_assoc($result))
                {
                    // Determine whether this post is the "topic post" or not
                    $result2 = $db->query('SELECT id FROM '.$db->prefix.'posts WHERE topic_id='.$cur_post['topic_id'].' ORDER BY posted LIMIT 1') or error('Unable to fetch post info', __FILE__, __LINE__, $db->error());

                    if ($db->result($result2) == $cur_post['id'])
                        delete_topic($cur_post['topic_id']);
                    else
                        delete_post($cur_post['id'], $cur_post['topic_id']);

                    update_forum($cur_post['forum_id']);
                }
            }
        }

Now, to optimize that a bit more, you can choose to run update_forum for all forums after the loop is done. And in 1.3, the query inside the loop will be unnecessary thanks to the new topics column first_post_id. However, the code is really not that difficult to understand and I think PunBB handles it quite nicely.

5 (edited by wmconlon 2007-10-18 03:50)

Re: spambots illustrate problems with punbb data model

Smartys wrote:

I would argue that PunBB is scalable. My understanding of replication is somewhat limited, but I believe you would have one master, several slaves, make all selects come from the slaves, and all inserts/deletes/updates go to the master. I don't see any case where PunBB makes that difficult.

php application code is  NOT atomic, since there can be multiple concurrent http POST operations, so the count accuracy cannot be guaranteed.  This is the case for punbb rcode running on a single server or multiple servers.

It might be that punbb is not not even scalable to multiple httpd daemons on the same server (the usual case) -- in other words it's only accurate for a single user forum, unless the counts are maintained as globals or you provide blocking to prevent simultaneous posts.

Thanks for pointing out the code.  This would be a great feature to have for admin.