1 (edited by colak 2006-06-05 09:39)

Topic: smart quotes

I just started a forum using punBB. Although I have registered here some time ago, this is my first real project using punBB.

Question: is there a way for PunBB to recognise smart quotes copied and pasted by other programs (ie word, pages, whatever). Our new forum seems to be attracting a lot of copy/pasters and as they do that smart quotes are inserted in the posts rendering the post non valid xhtml strict.

I realise that forums cannot be so 'strict' in xhtml validation but the problem with smart quotes is nevertheless there as they are not visible  by some browsers (replaced by questionmarks in ie mac... ok dead but still kicking for some users).

For the sake of this post here are some smart quotes: ? and here is the validation page

2

Re: smart quotes

anyone?

Re: smart quotes

Hrmm, you could replace them by their HTML entity...

4

Re: smart quotes

Doesn't php have a function for dealing with special characters and doesn't PunBB have its own version of it. Wouldn't running the message returned by parser.php or even better the message being posted through that solve the problem? Maybe somebody else could comment on whether that works as a solution.

Re: smart quotes

I think the problem is with the browser not decently transmitting it, because indeed, PunBB uses htmlspecialchars();

6

Re: smart quotes

hi,
and thanks for your replies
well I could replace them with their html entity but not all users can. Shouldn't this be part of the punBB parsing? I do not believe the problem is with the browser as the browser transmits what exists in the fields at the time. If what exists there is a copy/paste from a text processing program that's what it is going to transmit. I believe the character not to be a utf-8 one. And that is where the problem is.

Having non valid xhtml is not a big thing in forums but it would be cool if a global solution were to be found for these non utf-8 characters.

7

Re: smart quotes

The difficulty we are having is that there is a php function being applied which is supposed to take care of it. Its trying to work out why it not doing its job which is the difficult thing. I don't see that the browser could be the problem since this character is supposed to be removed before it ever gets to the browser. I'm sure there is a simple answer to this, I just don't know what it is.

This has to be fixed for version 1.3 since it is intended to be capable of being served as application/xhtml+xml and this problem could well cause a page to fail completely. I wonder if we need htmlentities() rather than htmlspecialchars().

Re: smart quotes

I looked at the htmlentities page and saw this

wwb at 3dwargamer dot net
31-Mar-2004 05:49
htmlentites is a very handy function, but it fails to fix one thing which I deal with alot: word 'smart' quotes and emdashes.

The below function replaces the funky double quotes with ", funky single quotes with standard single quotes and fixes emdashes.

   function CleanupSmartQuotes($text)
   {
       $badwordchars=array(
                           chr(145),
                           chr(146),
                           chr(147),
                           chr(148),
                           chr(151)
                           );
       $fixedwordchars=array(
                           "'",
                           "'",
                           '"',
                           '"',
                           '—'
                           );
       return str_replace($badwordchars,$fixedwordchars,$text);
   }

I also found this: http://shiflett.org/archive/165

9

Re: smart quotes

I think I prefer the shiflett version but it is nice to know there is an easy way around this. My question is do you run that against the output or do you run it before you save the message in the first place.

Re: smart quotes

Just as something in parser.php I guess... unless MySQL has problems with it.

11

Re: smart quotes

elbekko wrote:

Just as something in parser.php I guess... unless MySQL has problems with it.

What I had in mind is that parser.php is far more critical to performance than the routines run when a message is saved so the less it has to do the better. It just occurred to me that even though its best to save messages in as pure a form as possible if something is obviously crud wouldn't it be more efficient not to save it in the first place.

EDIT: Ignore me, thats only a good idea on a new forum, doesn't help one thats already up and running does it.

12 (edited by colak 2006-06-08 14:35)

Re: smart quotes

smartys
does your code automatically correct already posted threads?

>edit: corrected typos

13

Re: smart quotes

All posts, including existing ones go through parser.php before they are sent to the browser. If the function Smarty's posted is added to parser.php then it will deal with existing posts. I would just paste the function into parser.php somewhere near the top and then look for the line which says $text = htmlspecialchars($text) and then put $text = CleanupSmartQuotes($text); immediatelly following it. You can't do any harm by doing that since you are only changing what is sent to the browser i.e. you are not altering the stored messages in any way.

P.S. I think I prefer the function given here http://shiflett.org/archive/165 because it just uses standard quotes.

14

Re: smart quotes

I have copy/pasted the above script and seems to have reduced the errors. I'm determined to have this forum validating now smile

Another question which has to do with other non utf-8 characters
ie ? 8230 ellipsis
Can somebody please explain to me how to parse this one through?
I'm not sure how the char array works (ie chr(145),) is there a number attached to each character?
I'm trying to understand the logic of it so as to do the rest myself and stop pestering you people smile

15

Re: smart quotes

The ellipsis is chr(133) so you need to add chr(133) to the first array and the numeric reference 8230 to the second array. It shouldn't really cause a problem though because PunBB actually uses it for page numbering and its perfectly valid though PunBB does use the entity reference not the character itself.

Yes there is a number attached to each letter/character.

16

Re: smart quotes

Paul wrote:

Yes there is a number attached to each letter/character.

Hi paul, Any idea where I can find a list of those numbers attached to the characters?

ps. did you see my invite in the Show off forum?

Re: smart quotes

I encountered the same problem today, fixed it, and posted it in another thread, but I was unaware of this thread. smile

Re: smart quotes

colak wrote:

Hi paul, Any idea where I can find a list of those numbers attached to the characters?

http://www.lookuptables.com/

Re: smart quotes

This problem will fix itself when we switch to UTF-8, am I right?

"Programming is like sex: one mistake and you have to support it for the rest of your life."

20 (edited by Jérémie 2006-06-16 14:48)

Re: smart quotes

Well, that depend on how you handle utf-8. Right now, a 1.2 PunBB set to utf-8 has some problem. It doesn't appear with IE6 or Firefox, but for example the W3C html validator don't like them. Maybe you should apply Textile to the post's texts ? tongue

21

Re: smart quotes

I'm sure I've seen some one line functions using regular expressions on the php site which just take all the illegal characters and convert them to numeric references which is whats really required.