Topic: Spidering issues
I've integrated punBB into my site, which uses swish-e for site-wide spidering and search. Now that the forum has been running for a while, I decided to run a few queries, for example thy. I was surprised to see multiple references to the same page, each with a different viewtopic.php?pid=. I see some are from the author links, but there must be some other sources.
Each page is superficially identical, with the same byte count, ince swish-e takes an MD5 hash of each page to avoid duplicates, but they have unique URIs, so the same page gets indexed repeatedly.
But I'm curious about the design intent in using two URIs to reference the same resource? Is this so viewtopic.php can go to the second or later page of a multipage topic? If so, why not calculate the page number in advance and use, viewtopic.php?id=n&p=page. This way, each Resource would be Uniformly Identified.
Meanwhile, I will write a perl callback to exclude the ?pid= from getting spidered.