Wednesday, August 17, 2011

UTF-8, multibyte functions in php web application

Output text/string in UTF-8 encoding

There are several ways to tell browser how to encode a page. One way is to specify a meta tag in html:

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This approach is simple and easy. But it does contain some disadvantages. One issue is, most browsers will have to start re-parsing the document after reaching the meta tag, because they may have already parsed the document with incorrect encoding. This may cause a delay in page rendering. Due to this, it is better that we output the UTF-8 header with PHP.

Output UTF-8 header with PHP:

We use PHP's header function header("Content-Type:text/html;charset=utf-8");

The method is safe enough to ensure the page is encoded with UTF-8. However, it is obviously not as convenient as the way of specifying meta tag of http-equiv, which can be done in a layout template or a header file and then is included in other pages.

We still have the third solution. Assuming we are using Apache web server(i believe most PHP apps run on apache), we can specify the charset in a specific .htaccess file.

Specify chartset in .htaccess file:

AddDefaultCharset utf-8

In this way we send regular http utf-8 head through web server configuration.

PHP multibyte string functions

We know that PHP provides a set of multibyte string functions, which are prefixed with 'mb_'. There are some interesting things about them. Let's take mb_substr for example. I'm using PHP 5.3.3 with the default php.ini configuration. To do the test, nothing could be better than using my native language: Chinese.

<?php
header("Content-Type:text/html;charset=utf-8");
$str = '我爱编程';
echo substr($str, 2, 2), '<br>';
echo mb_substr($str, 2, 2), '<br>';
?>

Ok, let me explain. The first line header("Content-Type:text/html;charset=utf-8");  simply sends a utf-8 header to ensure the output is encoded with UTF-8. The second line, $str = '我爱编程', is a Chinese characters string. It is 4 Chinese characters. In Chinese, 1 character is 1 word as well, so the string is also 4 words.  Translating it into English is 'i love programming', which is 3 words,  18 characters including space.

Now, I want to return part of the Chinese string, starting from position 2, and length is 2. The correct result should be 编程.
The third line: echo substr($str, 2,2), '<br>'. Here we use normal substr. As we can expect, the output would be wrong. On my screen, the output is ��
Next, the last line: echo mb_substr($str, 2, 2), '<br>'. Now we use PHP's mb_substr and expect it could work properly. Does it work? Unfortunately, it doesn't! The output is still ��. Let's check PHP manual about mb_substr: "string mb_substr ( string $str , int $start [, int $length [, string $encoding ]] ). The encoding parameter is the character encoding. If it is omitted, the internal character encoding value will be used." So, based on the manual, we change our code to:

<?php
header("Content-Type:text/html;charset=utf-8");
$str = '我爱编程';
echo substr($str, 2, 2), '<br>';
echo mb_substr($str, 2, 2, 'UTF-8'), '<br>';
?>

This time, it works! The output is 编程, as what we expect. So in this case, we can't simply use mb_substr and expect it can work properly. We still have to specify the encoding method. If we don't specify the character encoding, PHP will use the internal character encoding value. Then we have another question: what is the internal character encoding value? To answer this question, we must have a look at our php configuration file. Let's open our php.ini. We can find a [mbstring] section.

[mbstring]
...
;mbstring.internal_encoding = EUC-JP
...

Ok, that is quite clear now. Let's uncomment this line and change the value to UTF-8, and then restart Apache server(Don't forget this).

[mbstring]
...
mbstring.internal_encoding = UTF-8
...

Now, let's try this code again:

<?php
header("Content-Type:text/html;charset=utf-8");
$str = '我爱编程';
echo substr($str, 2, 2), '<br>';
echo mb_substr($str, 2, 2), '<br>';
?>

Now it works, mb_substr is using internal encoding value, and that value has been set to UTF-8.

One thing i don't like about PHP(and javascript) is, for a same task, it always provides a few different ways to do it. I used to call it too much flexibility (http://hengrui-li.blogspot.com/2011/04/too-much-language-flexibility-good-or.html). Recently i learned the core philosophy of Python: "There should be one - and preferably only one - obvious way to do it" and i found why i don't think PHP is a great programming language(purly from programming language perspective, not from the point that how it boosts web and makes web programming so easy).

So, let's suppose we only want to use one function for the task "to return part of a string". Obviously, mb_substr is our choice, and replace all substr in old system with mb_substr is not hard. But what if we don't want to change our code? Or what if we simply think typing mb_substr is less efficient than typing substr?

Actually, mbstring supports overloading the existing string manipulation functions. If we enable overloading, when we call substr(), PHP will actually call mb_substr() automatically. Let's see how to enable overloading in php.ini:

[mbstring]
...
; overload(replace) single byte functions by mbstring functions.
; mail(), ereg(), etc are overloaded by mb_send_mail(), mb_ereg(),
; etc. Possible values are 0,1,2,4 or combination of them.
; For example, 7 for overload everything.
; 0: No overload
; 1: Overload mail() function
; 2: Overload str*() functions
; 4: Overload ereg*() functions
;mbstring.func_overload = 0
...

So we simply uncomment the last line, and set to value to 7: mbstring.func_overload = 7, restart apache, and try the code:

<?php
header("Content-Type:text/html;charset=utf-8");
$str = '我爱编程';
echo substr($str, 2, 2), '<br>';
echo mb_substr($str, 2, 2), '<br>';
?>

We can find both functions work fine! But doing this overloading can cause issues. If we are using normal string manipulation functions to handle real binary data(it means real binary data,  NOT the text string treated as binary), enable overloading could break the binary handling code. Although i hardly see PHP code need to handle real binary data, it is safest that we simply use mb_ string functions.  Just remember this on PHP manual also: "It is not recommended to use the function overloading option in the per-directory context, because it's not confirmed yet to be stable enough in a production environment and may lead to undefined behaviour."

No comments: