Anne van Kesteren

Character references to UTF-8 converter in PHP

9 May 2005

Besides going to school, the supermarket and studying for a bit my main project of the day was coding something in PHP that turned character references into UTF-8 encoded characters. Considering that XML parsers must treat character references correctly anyway you might call it a useless project. Nevertheless, I tackled regular expressions for a bit, learned a lot from Henri Sivonen and used a function of him he had rewritten from the Mozilla C++ implementation. I also fixed an ugly bug regarding the twenty-seven differences between iso-8859-1 and windows-1252. Yes, although I’m using UTF-8 for this weblog it was still possible to enter some of these invalid characters as character references. Now they are converted to their safe equivalent and after that converted to UTF-8 characters. I think this might be a bug in the XML PHP parser regarding those characters. It should positively reject them right away. But then again, PHP is a messy language with ugly bugs. A correction is in place here. Those characters are supposed to be control characters of no use, so PHP is treating them correctly as being valid.

Why am I still using PHP? Shoot me; I’m not a programmer and PHP was the first thing I learned after writing valid HTML and CSS. PHP is a language that lets me set up things quickly without asking too many questions, unless of course things get tricky, like today. A future project should probably be based on Python or the “hey, I set it up in less than 2 hours” Ruby on Rails. In combination with a PostgreSQL database to please Faruk and to not use MySQL like everyone else. Because you know; being the same isn’t always fun.

(That I’m probably contradicting myself in several ways in that last sentence is fortunately not the point; nor the topic of today’s conversation.)

To get to the point: I implemented Jacques Distler’s HTML+MathML entities replacement trick in PHP. On top of that, as that was the easy part, I use the function string2utf8 which basically calls two preg_replace_callback’s to get the job done. I made up the first and final argument and I really appreciate KtK’s input on the second. Especially using $s was something that hadn’t passed my mind and isn’t really documentated in an obvious way to me. Here is that function:

function string2utf8($string){
 $string = preg_replace_callback('/&#([0-9]+);/',create_function('$s','return dcr2utf8($s[1]);'),$string);
 return preg_replace_callback('/&#x([a-f0-9]+);/i',create_function('$s','return dcr2utf8(hexdec($s[1]));'),$string);
}

As you have undoubtedly noted, that function calls another function which accepts the decimal value of a character reference. (That is the exact reason for using hexdec in the second preg_replace_callback call.) The function that is called by preg_replace_callback is written by Henri Sivonen as said before, who based it on Mozilla’s code base. (See: UTF-8 to Code Point Array Converter in PHP.) I slightly modified it and renamed it:

function dcr2utf8($src){
 $dest = '';
 if($src < 0){
  return false;
 }elseif($src <= 0x007f){
  $dest .= chr($src);
 }elseif($src <= 0x07ff){
  $dest .= chr(0xc0 | ($src >> 6));
  $dest .= chr(0x80 | ($src & 0x003f));
 }elseif($src == 0xFEFF){
  // nop -- zap the BOM
 }elseif ($src >= 0xD800 && $src <= 0xDFFF){
  // found a surrogate
  return false;
 }elseif($src <= 0xffff){
  $dest .= chr(0xe0 | ($src >> 12));
  $dest .= chr(0x80 | (($src >> 6) & 0x003f));
  $dest .= chr(0x80 | ($src & 0x003f));
 }elseif($src <= 0x10ffff){
  $dest .= chr(0xf0 | ($src >> 18));
  $dest .= chr(0x80 | (($src >> 12) & 0x3f));
  $dest .= chr(0x80 | (($src >> 6) & 0x3f));
  $dest .= chr(0x80 | ($src & 0x3f));
 }else{ 
  // out of range
  return false;
 }
 return $dest;
}

I’m not going to post the function safe_cr here, which by the way stands for ‘safe character references’ and converts the twenty-seven differences to their safe equivalents and also makes it possible for you, the holy end user, to enter HTML and MathML entities (over twenty-one-hundred) into the comment system. I convert them and then I give it to the XML parser. I published safe_cr along with the other functions here: Character references to UTF-8 functions plus HTML and MathML entities converter. Have fun.

For who wants to know, dcr2utf8 stands for ‘decimal character reference to UTF-8’. Something just occured to me: By implementing this my comment section is partly conforming (I guess as conforming as possible) to the UTF-8+names proposal from Tim Bray. There is also some start at an official RFC. Tim, may you ever read this, why didn’t you push this further?

Comments

/&#x([a-zA-Z0-9]+);/ seems like a weird regular expression for hex values.
Shouldn't it be /&#x([a-fA-F0-9]+);/?
Posted by Tommy Olsson at 1:37PM
Yeah, that might improve it just a little bit. Note though that invalid characters are rejected anyway later in the run, but you’re right. Paying attention to details didn’t cross my mind when writing this. Actually, it never crosses my mind when writing PHP.
Posted by Anne at 2:22PM
Bwuh, he uses my deprecated nickname :)
I thought about what Tommy said when I was lying in my bed, but that's kinda mustard-like.
My simple version, which doesn't take those 27 differences into account, can be found on click here btw.
Also, an example of using $s inside create_function() together with preg_replace_callback() is waiting for you on php.net. There's way too much (documentated in an obvious way) on that site already, if you ask me :)
Posted by Krijn Hoetmer at 3:00PM
[...] being the same isn’t always fun.

;)
But, it's almost no difference between MySQL and PostgreSQL, isn't it? I can hardly remember the times I worked with it, flanked by a nice J2EE framework running on WebLogic...
Posted by Jens Meiert at 3:45PM
But, it's almost no difference between MySQL and PostgreSQL, isn't it?
Check out this comparison.
Posted by Krijn Hoetmer at 4:07PM
Thank you, Krijn.
Posted by Jens Meiert at 4:15PM
WARNING WARNING BULLSHIT ALERT!
Sorry, but this comparison that Krijn linked to is horrendously worthless. They have no idea how to do proper benchmark tests, as they have no idea how PostgreSQL really works (nor is optimized). As a result, their results indicate that PostgreSQL would be much slower. However, using proper transactions (auto-commits are not the same!) will speed up PostgreSQL way more than in their tests, even, so the whole article on that page is futile.
Furthermore, the only thing they really tested are some basic INSERTs and SELECTs. Yes, if you're gonna do nothing more than what MySQL was built on (a fast INSERT-and-SELECT SQL system), then you'll favor MySQL. When doing some actual RDBMS tests, one quickly encounters a problem: MySQL isn't a fully qualified RDBMS, as it doesn't natively support Transactions (only on InnoDB tables). Okay, so that aside, let's just test features. Oh wait, no dice. Half of all the useful features that you'd want to test don't exist in MySQL. Planned for 5.0 or 5.1. It'll be 2 more years before those releases will be anywhere near as stable as PostgreSQL, and only God knows when they'll be as fast as PostgreSQL.
To sum up: ignore that silly comparison article. There's a world of difference between MySQL and PostgreSQL, and for anything more complicated than a simple card-catalog system (i.e. basic INSERT, UPDATE and SELECT stuff), PostgreSQL is a much better choice. Don't need anything whatsoever beyond simple queries? Go with MySQL. Want to have some database intelligence? Security and reliability? Useful features (functions, triggers, transactions, views, subqueries, procedural languages, constraints, etc.)? Go with PostgreSQL.
Also, /&#x([a-f0-9]+);/i would make even more sense. :)
Posted by Faruk Ates at 5:31PM
I’m not going to post the function safe_cr here, which by the way stands for ‘safe character references’ and converts the twenty-seven differences to their safe equivalents and also makes it possible for you, the holy end user, to enter HTML and MathML entities (over twenty-one-hundred) into the comment system […]

Why not? It should be interesting to see… or has it already been posted elsewhere?
By the way, I think the <code> tag extended too far there (safe_cr here instead of safe_cr).
Posted by Aankhen at 5:54PM
The main reason is that it is over twenty-one-hundred lines and about 67kB large. (Mainly due to the entities.) I guess I could post it in a separate file though. Check back later today.
Posted by Anne at 6:14PM
The XML PHP parser operating correctly
Those 27 characters are not invalid Unicode characters. They are simply rarely used control characters. Given that you are unlikely to be wanting to display such control characters on your website, and given how frequently you will be encountering the win-1252 versions of this data, it makes sense to convert them to their true Unicode equivalents.
Posted by Sam Ruby at 6:44PM
Sam, thanks. I added an erratum to that statement.
In related new, I also published the functions here.
Posted by Anne at 7:05PM
When I type Θ into this comment form, and click on Preview, it is converted to utf-8, and displayed as Θ. In the <textarea>, in which I can re-edit my comment, it is also converted to Θ. If that was a typo on my part, and what I really meant was θ (θ), I need to delete and retype, rather than changing a "T" to a "t" (or maybe I really meant ϑ (ϑ) ).

Personally, I'd return the user what he typed, and convert to utf-8 on display (and, when the comment is finally POSTed, on storage to the database). One of the reasons for having named entities is that they are easier to enter/edit/remember than the corresponding Unicode codepoints.

Posted by Jacques Distler at 10:40PM
Good point. I wanted to make it that way but I forgot to actually do it. I will modify it once I got the time.
Posted by Anne at 10:50PM
@Faruk: I'm with you 100%. Too bad none of the hosting companies I'm working with right now allow Postgres. Stuck in the MySQL world again.
Posted by Daniel Morrison at 12:29AM
Haha, thank you, too, Faruk ;)
Posted by Jens Meiert at 1:02AM
In related new, I also published the functions here.

Thank you. :-)
Posted by Aankhen at 1:58AM
Your code should work in most cases. However in SGML a character reference does not always have to end with a semi-colon (;). In some cases it is allowed, but recommended against, to remove the semi-colon
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

This is supported in Firefox and to even some greater extent in Internet Explorer where I&ntildet&eumlrn&acircti&ocircn&agraveliz&aeligti&oslashn will be decoded as Iñtërnâtiônàlizætiøn. If you want to decode named character references for security reasons, for example for filtering 'javascript:' URIs, you need to be as just as liberal as IE, because otherwise you run the risk of not detecting every single 'javascript:' URI.
Recently I've created some similar functions that are just as liberal as IE. The library file can be downloaded from my website: http://www.rakaz.nl/projects/entity.phps
Posted by Niels Leenheer at 4:52AM
Niels, thanks for your comment. Much appreciated. However, I’m not sure if I need to be as liberal as Internet Explorer though, as such character reference usage is just rejected and not allowed.
Posted by Anne at 6:59AM
Jacques Distler, consider it fixed. If you find anything that doesn’t work please let me know.
Posted by Anne at 7:08PM
One question regarding UTF-8 and printing the actual language on screen! How do you print Ø± Ø§Ù„Ø³Ù„Ø§Ù… Ø§Ù„Ø§Ø³ØªØ«Ù…Ø§Ø±ÙŠ Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠ (which is some UTF-8 jibberish) in its native form (i.e. Arabic)? Sorry if this is the wrong place to ask.
Posted by Abu Aaminah at 11:29AM