1 (edited by wmconlon 2006-06-10 16:21)

Topic: Spidering issues

I've integrated punBB into my site, which uses swish-e for site-wide spidering and search.  Now that the forum has been running for a while, I decided to run a few queries, for example thy. I was surprised to see multiple references to the same page, each with a different viewtopic.php?pid=.  I see some are from the author links, but there must be some other sources.

Each page is superficially identical, with the same byte count, ince swish-e takes an MD5 hash of each page to avoid duplicates, but they have unique URIs, so the same page gets indexed repeatedly.

But I'm curious about the design intent in using two URIs to reference the same resource?  Is this so viewtopic.php can go to the second or later page of a multipage topic?  If so, why not calculate the page number in advance and use, viewtopic.php?id=n&p=page.  This way, each Resource would be Uniformly Identified. 

Meanwhile, I will write a perl callback to exclude the ?pid= from getting spidered.

Re: Spidering issues

It's so you can link to specific posts rather than topics (ie: from index.php for last posted)

3 (edited by wmconlon 2006-06-11 06:20)

Re: Spidering issues

But it doesn't seem to link to a specific post, as in an anchor on a page (id=n#p, where n is the topic, and p is the ordinal of the post number) but instead renders the same page as for the parent topic.  Of course, if the link is on page 2,3,4, etc, then yes, you go to a specific page.  But following the link takes you to the top of the page -- not to a particular post.

Say there are 15 posts per page -- we end up with 16 links (topic plus 15 posts) that end up rendering the same page.  This bothers me, as I would much prefer that there be one to one mapping of pages and URIs.

Maybe I'm missing something.  It seems to me that it  would be more useful for an authors link, for example, to go to the topic id, with the page number and anchor.  I'll have to take a look at what's involved in the sql.

Re: Spidering issues

But it doesn't seem to link to a specific post, as in an anchor on a page (id=n#p, where n is the topic, and p is the ordinal of the post number) but instead renders the same page as for the parent topic

Right. The links look like this
http://punbb.org/forums/viewtopic.php?pid=70546#p70546
It actually, as you guessed, goes to the page the post is on
This way you need to store only one variable: post ID. Your way, you would need to store post ID, topic ID, and page in order to link to the post in the same way.

Re: Spidering issues

Smartys wrote:

This way you need to store only one variable: post ID. Your way, you would need to store post ID, topic ID, and page in order to link to the post in the same way.

I see. Yes, one might want to store the post ordinal for efficiency, but I don't think it's necessary as I imagine punBB is already using
ORDER BY post_timestamp.  The LIMIT BY clause is  probably simpler if a page number is included, since the start row is POSTS_PER_PAGE x PAGE_NUMBER

Semantically, it make more sense to me to always go to viewtopic.php?id=topicid, instead of sometimes to viewtopic.php?pid=postid.  And it would make my external spidering easier.

But maybe there are plans for trackback that would make use of the postid???