1 (edited by Keulig 2008-02-02 15:07)

Topic: Url rewriting ° & ¤ issue

Hi,

I wanted to notice you of a big bug i just discovered : you can't use the characters "°" and "¤" in a word of more than 3 letters in a topic title or you get a 403 error when trying to have a look at the topic itself.

I haven't tested a lot of chars but imho you should restrict your rewriting function to lowercase chars and numbers (for classy urls and better seo). This might be a problem for cyrillic and asian chars, though.

Re: Url rewriting ° & ¤ issue

Any news about fixing this ?

Re: Url rewriting ° & ¤ issue

Would just using rawurlencode work?

Re: Url rewriting ° & ¤ issue

It actually does work but look at this fucked up url : "topic10-Rock-n%25A462-Fevrier-2006.html"

I think that sef_friendly() needs a little more work to be perfect smile (I mean, do we need °¤(){}[] etc... in our urls ?)

Re: Url rewriting ° & ¤ issue

I agree that sef_friendly() needs a bit more work. Essentially, we only want A-Za-z0-9 in the URL.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Url rewriting ° & ¤ issue

Actually a-z0-9 should be better, what do you think ? Uppercase letters in urls do not optimize referencing nor user experience, imho.

Re: Url rewriting ° & ¤ issue

Rickard wrote:

I agree that sef_friendly() needs a bit more work. Essentially, we only want A-Za-z0-9 in the URL.

We need valid language characters in the URL, rather than symbols. And that is not limited to A-Z. A-Z is only good for English.

Re: Url rewriting ° & ¤ issue

Correct about lowercase.

Solovey wrote:

We need valid language characters in the URL, rather than symbols. And that is not limited to A-Z. A-Z is only good for English.

Well, there's only so much we can do for Russian, Chinese etc. We can replace stuff like é with e and ä with ä, and after that, enforce a-z0-9-, but that's about it. As far as I know, you can't put non-ascii characters into a URL. If you do, they'll get URL encoded by the browser.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

Re: Url rewriting ° & ¤ issue

Thats right Rickard, we can't hope for more than this. In a few years we'll be able to include utf8 chars in urls but at the moment this is useless and makes urls unreadable for the user.

Re: Url rewriting ° & ¤ issue

I've committed an updated version of sef_friendly. It doesn't handle non iso-8859-1 characters because adding support for that leads to A LOT more code.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

11

Re: Url rewriting ° & ¤ issue

THAT is good my friend smile

12 (edited by Solovey 2008-02-25 06:23)

Re: Url rewriting ° & ¤ issue

Rickard wrote:

Correct about lowercase.

Solovey wrote:

We need valid language characters in the URL, rather than symbols. And that is not limited to A-Z. A-Z is only good for English.

Well, there's only so much we can do for Russian, Chinese etc. We can replace stuff like é with e and ä with ä, and after that, enforce a-z0-9-, but that's about it. As far as I know, you can't put non-ascii characters into a URL. If you do, they'll get URL encoded by the browser.

All that needs to be done is conversion of the entire URL (except the domain name) to % encoding.

The better browsers will display it as real Unicode rather than % encoding, but urls should be issued in % encoding (http://www.php.net/manual/en/function.urlencode.php). This is the right way to do this rather than stripping accents etc. And it works for all languages.

Only older browsers will actually display this as %bla.

Google correctly indexes both real unicode and % encoded URLs in the index, so you get SEO power with URL rewriting even of non-ascii URLs. Google then displays them as real unicode in search results.

Just take a look at what Wikipedia is doing. They do it correctly with 100% non-ascii URLs for all wikis.

Example in English:

1) Regular URL:  http://en.wikipedia.org/wiki/É
2) % encoded URL: http://en.wikipedia.org/wiki/%C3%89   <--- even this will display as 1 in Google.

Also, the official word from Google is to use _ instead of %20 (space) in URLs, so that should be done too.

The other way is to support an optional numbers only URL rewriting scheme, in the style http://test.com/10001/ for forum post 1001.

Here is a scheme that words for all languages:

) lowercase

) Strip stop words (can be obtained from any localized search list for any language).

) Shorten to max valid URL length (but we should be short enough with a message title.).

) Replace space with _ and then get rid of double _

) If result is emply or only spaces, get the first sentence of the message and go back to step 1.

) Convert to % encoding.
http://www.php.net/manual/en/function.urlencode.php
string urlencode ( string $str )

Re: Url rewriting ° & ¤ issue

I haven't tried Opera, but so far, the only browser I've come across that supports the display of unicode characters in the address bar is Firefox3. In Firefox2, IE6 and IE7, it looks horrible.

Your scheme for all languages isn't as easy as it looks. For example, lowercasing UTF-8 text is a major pain if mb_string isn't available.

We support a straight numeric URL rewriting scheme as well. Two of them actually.

There's also a hook in sef_friendly() so an extension can easily replace the functionality we have today.

"Programming is like sex: one mistake and you have to support it for the rest of your life."

14

Re: Url rewriting ° & ¤ issue

I don't like non-ascii chars in urls because when I copy-paste a link, my friends can't read %bla chars.

Re: Url rewriting ° & ¤ issue

Solovey wrote:

The better browsers will display it as real Unicode rather than % encoding, but urls should be issued in % encoding (http://www.php.net/manual/en/function.urlencode.php). This is the right way to do this rather than stripping accents etc.

No it's not.

Or, to be more precise, it really depend on the context: the language, the application, the website, etc.

For example, the Wikipedia way of systematically using % encoding is the right way. It's a very simple point of entry: you want the article about ?Jérémie?, you add this to the last part of the URL and it works.

On the other hand, it render the URL unreadable by human and by some older machines. For a forum software such as PunBB, for French, I can clearly state that romanization is the way to go.

And it works for all languages.

Nope. There's a huge, colossal, phishing issue about unicode URL. Some characters are almost impossible to distinguish (and that's by a expert, there a large amount of characters the general public won't identify as non Roman), and it's right now a real security threat. It has to be solved, and it will be, because there's no ground for a pure basic English only domain names and URLs; but we're not there yet.

And there's also an issue with human interaction. How would I go to punbb.org/??????????? with my azerty (or qwerty, or whatever) keyboard? Yes, the general public Google it, and almost never type an URL by hand. But I would say no to ignoring an issue and lowering usability to the largest common factor, as a web user I _demand_ to keep my own usability.

Only older browsers will actually display this as %bla.

I doubt Firefox 2 qualify as an old browser, just to point one example.

And remember that domains & URL are used by UA, not just browser. Google bot is an UA for example, and wget, and zillions of similar tools that have very few in common with a web browser as you may know it.

Google correctly indexes both real unicode and % encoded URLs in the index, so you get SEO power with URL rewriting even of non-ascii URLs. Google then displays them as real unicode in search results.

That's good. But I think it's a minority right now.

As a user, I would advocate such encoding in the future, but I think it's counter productive right now. If I want to buy a TV spot for jérémie.fr, I *know* it will create some confusion and dispersion, and I *know* that I will have to buy jeremie.fr or someone will steal my prospects.

Also, the official word from Google is to use _ instead of %20 (space) in URLs, so that should be done too.

Google doesn't design web standards. The W3C does, and the RFC. I don't know what the standard is, and I know that if it exist it need to be perfected to handle pishing, human error, and human usability.

The other way is to support an optional numbers only URL rewriting scheme

The only right way of doing this is allowing every single one to choose what URl scheme they want. And PunBB 1.3 does this already with its extenstion system.

Here is a scheme that words for all languages:

By definition, this doesn't exist. let's take it apart:

) lowercase

Nope. In some language, the same word with or without uppercase doesn't mean the same thing. If you want to advocate the "right" thing, and the "future right now", then respecting the language is the only way to go, and that mean going all the way.

Lowercase is the best option for the current romanized scheme yup. But that's it.

) Strip stop words (can be obtained from any localized search list for any language).

Why?

Again, stop words may be useless, in the way, for some applications, but may not be in other cases. You is a stopword, youtube agrees.

Again, there is no ?one rule to rule them all?. Each one, if competent, choose what's best, and can hook into PunBB to make it work that way.

) Replace space with _ and then get rid of double _

From a pure, very basic seo point of view that's nonsense. Google sees - as a word separator, not _.

16

Re: Url rewriting ° & ¤ issue

Jérémie, wikipedia uses urls as keys to its articles, whereas punBB uses numbers as keys to topics, that's why non-ascii chars can be stripped from our urls and it also makes urls more readable. (topic-1-jérémie is the same as topic-1-jeremie, but "jeremie" looks better in the adress bar of our navigator).

17 (edited by SuperMAG 2008-02-25 10:15)

Re: Url rewriting ° & ¤ issue

by the way look at the atom and rss ... even when i change the file (facny ) type the atom and rss remains the post type ... check ur self or here:

http://supermag.wsnw.net/1.3/topic2-sim … -work.html

atom:

  <?xml version="1.0" encoding="utf-8" ?> 
- <feed xmlns="http://www.w3.org/2005/Atom">
  <title type="html">SuperMAG - simbols topic ~!@#$%^&()&_+_+|??><{":L if work</title> 
  <link rel="self" href="/1.3/rewrite.php" /> 
  <updated>2008-02-25T09:57:51Z</updated> 
  <generator>PunBB</generator> 
  <id>http://supermag.wsnw.net/1.3/topic2$2.html</id> 
- <entry>
  <title type="html">simbols topic ~!@#$%^&()&_+_+|??><{":L if work</title> 
  <link rel="alternate" href="http://supermag.wsnw.net/1.3/post2.html#p2" /> 
  <content type="html">simbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if work</content> 
- <author>
  <name>SuperMAG</name> 
  </author>
  <updated>2008-02-25T09:57:51Z</updated> 
  <id>http://supermag.wsnw.net/1.3/post2.html#p2</id> 
  </entry>
  </feed>

rss:

<?xml version="1.0" encoding="utf-8" ?> 
- <rss version="2.0">
- <channel>
  <title>SuperMAG - simbols topic ~!@#$%^&()&_+_+|??><{":L if work</title> 
  <link>http://supermag.wsnw.net/1.3/topic2$2.html</link> 
  <description>The most recent posts in simbols topic ~!@#$%^&()&_+_+|??><{":L if work.</description> 
  <lastBuildDate>Mon, 25 Feb 2008 09:57:51 +0000</lastBuildDate> 
  <generator>PunBB</generator> 
- <item>
  <title>simbols topic ~!@#$%^&()&_+_+|??><{":L if work</title> 
  <link>http://supermag.wsnw.net/1.3/post2.html#p2</link> 
  <description>simbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if worksimbols topic ~!@#$%^&()&_+_+|??><{":L if work</description> 
  <author>dummy@example.com (SuperMAG)</author> 
  <pubDate>Mon, 25 Feb 2008 09:57:51 +0000</pubDate> 
  <guid>http://supermag.wsnw.net/1.3/post2.html#p2</guid> 
  </item>
  </channel>
  </rss>

Look at the urls inside it :

http://supermag.wsnw.net/1.3/topic2-sim … -work.html
http://supermag.wsnw.net/1.3/topic2$2.html

this makes the urls dublicate

--------------------------------------------------

about the url dublicate ...

the old problem still exist:

http://supermag.wsnw.net/1.3/topic2-sim … -work.html

http://supermag.wsnw.net/1.3/topic2.html

http://supermag.wsnw.net/1.3/topic/2/

http://supermag.wsnw.net/1.3/viewtopic.php?id=2

MyFootballCafe.com  is Now Online!

Re: Url rewriting ° & ¤ issue

Keulig wrote:

Jérémie, wikipedia uses urls as keys to its articles, whereas punBB uses numbers as keys to topics, that's why non-ascii chars can be stripped from our urls and it also makes urls more readable. (topic-1-jérémie is the same as topic-1-jeremie, but "jeremie" looks better in the adress bar of our navigator).

Yes, exactly. That's what I'm saying in this, and other threads smile

Quite frankly, that's because PunBB aim as fast and simple, and it was perceived that keying on an text key was too complex (more queries, duplicates, etc etc.). But again, extensions apply here.  If the basic, lowest common denominator used in vanilla URL scheme doesn't work for me, I can use my own, as complex as I want (well, if we trust the coders around here to request the appropriate hooks tongue ).

19 (edited by Solovey 2008-02-25 14:51)

Re: Url rewriting ° & ¤ issue

Jérémie wrote:

For example, the Wikipedia way of systematically using % encoding is the right way. It's a very simple point of entry: you want the article about ?Jérémie?, you add this to the last part of the URL and it works.

On the other hand, it render the URL unreadable by human and by some older machines. For a forum software such as PunBB, for French, I can clearly state that romanization is the way to go.

You may want to do romanization for French, but that is not applicable for every language.

Nope. There's a huge, colossal, phishing issue about unicode URL.

This pertains to domain names themselves, and has absolutely nothing to do with the URL or rewriting, and as such is beyond the scope of the URL rewriting feature.


And there's also an issue with human interaction. How would I go to punbb.org/??????????? with my azerty (or qwerty, or whatever) keyboard?

How would you typein access a website in Hebrew? Well, that really is for YOU to figure out. The solution is simple: learn Hebrew! Or, use copy and paste.

The point of URL rewriting is not to let people type-in URLs. The point is to make them friendly for search engines to score higher on keyword to URL matching in search results.

I doubt Firefox 2 qualify as an old browser, just to point one example.

Firefox qualifies as a browser that is marginal and has low internationalization support.

If I want to buy a TV spot for jérémie.fr, I *know* it will create some confusion and dispersion, and I *know* that I will have to buy jeremie.fr or someone will steal my prospects.

Once again you are confusing IDN (internationalized domain names) with URLs.

We are talking about test.com/THIS_PART_OF_THE_URL not the domain name.

Google doesn't design web standards. The W3C does, and the RFC. I don't know what the standard is

Thats absolutely right. Google is merely following W3C standards and RFC.

Here is the RFC: http://www.ietf.org/rfc/rfc2396.txt

Uniform Resource Identifiers (URI): Generic Syntax
August 1998

Notice the date? 1998? 10 years ago? And you are suggesting it is too early for PunBB to support it?

It is regretable that closed-minded and possibly xenophobic Western developers like you are so block headed about standards that support languages of the world.

It is unfortunate that you still resist internationalizaton, even when you promise us utf-8 support in the new PunBB.

This is not a small issue. Do it right the first time, and you won't have complaints in the future.

From a pure, very basic seo point of view that's nonsense. Google sees - as a word separator, not _.

I suggest you do some reading on Google's latest thoughts on this.

July 23, 2007 10:24 PM PDT
Underscores are now word separators, proclaims Google

http://www.news.com/8301-10784_3-9748779-7.html

20 (edited by Jérémie 2008-02-26 03:11)

Re: Url rewriting ° & ¤ issue

I was going to answer point by point, especially with that amount of wrapping around my meaning, I even read again the appropriate RFC. Then, I saw that I'm closed minded and xenophobic, so I won't bother.

21 (edited by SuperMAG 2008-02-27 20:17)

Re: Url rewriting ° & ¤ issue

____ I WAS WRONG SO I DONT WANT ANY ONE TO CONFUSE WIH MY POST ____

MyFootballCafe.com  is Now Online!

Re: Url rewriting ° & ¤ issue

Uh, as far as I know we're not using spaces.

Re: Url rewriting ° & ¤ issue

Smartys wrote:

Uh, as far as I know we're not using spaces.

i know we are not using spaces but we are using - as a seperator ... can u guys convert it to _ as a seperator

MyFootballCafe.com  is Now Online!

Re: Url rewriting ° & ¤ issue

What for? Why would they do that?

Re: Url rewriting ° & ¤ issue

http://www.mattcutts.com/blog/dashes-vs-underscores/
A reason why not, even if the entry IS over two years old tongue

Edit:
http://www.mattcutts.com/blog/guest-pos … w-session/ | (search for the word hyphen)
http://www.mattcutts.com/blog/whitehat- … -bloggers/

If you read Stephan Spencer?s write-up, he says some people thought that underscores are the same as dashes to Google now, and I didn?t quite say that in the talk. I said that we had someone looking at that now. So I wouldn?t consider it a completely done deal at this point. But note that I also said if you?d already made your site with underscores, it probably wasn?t worth trying to migrate all your urls over to dashes. If you?re starting fresh, I?d still pick dashes.