PHP and UTF-8 Howto |
PHP and UTF-8 Howto - Experiences from WebCollabWriting the UTF-8 version of WebCollab in early 2004 was not straightforward. There was not much good information on PHP with UTF-8, and a lot of bad information. However, contrary to many doomsayers, PHP can be made to run with UTF-8 without too much trouble. This page documents how we successfully made WebCollab to be UTF-8 functional. PHP mbstring libraryPHP has an optional library specifically for handlingmulti-byte strings, known as mb_strings (short for multi-byte strings library). This library makes using UTF-8 much easier. Because this library is optional, not all web hosting providers enable mbstrings on their implementation of PHP. However given that UTF-8 is becoming more widespread, most providers should now provide this library For most of the mb_strings functions there are also discrete PHP code equivalents/workarounds that can be found on the web. There is no real advantage in using these workarounds, and a number of disadvantages:
For a working example of a PHP UTF-8 application, visit the demo website for WebCollab HTTP HeadersFirstly we must correctly set the HTTP headers to instruct the browser to use UTF-8: header( 'Content-Type: text/html; charset=UTF-8' ); Then to make doubly sure the browser uses UTF-8, we send a meta tag in the HTML head: <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> PHP Internal EncodingBy default PHP uses 'ISO-8859-1' for it's internal encoding schema. Change this to UTF-8: mb_internal_encoding( 'UTF-8' ); This makes the PHP internal functions 'UTF-8 aware'. It also ensures that input and output are in UTF-8 with PHP trying to force character set changes. HTTP Form SubmissionAlthough not specifically mandated by the W3C, almost all web browsers will submit an HTTP form in the samecharacter set as the page was served up in. Put another way, if you deliver your pages in UTF-8, then submitted responses will also be in UTF-8. There is no need to try and verify the character set in a submitted HTTP form. Our experience has been that trying to accurately determine the submitted character set will result in more 'false positive' errors than just accepting that it is correct. Character ValidationOverly long UTF-8 sequences and UTF-16 surrogates are a serious security threat. Validation of input data is very important. An algorithm using preg_replace() is given below, and is used in current versions of WebCollab. $body = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'. In the above algorithm the first preg_replace() only allows well formed Unicode (and rejects overly long 2 byte sequences, as well as characters above U+10000). The second preg_replace() removes overly long 3 byte sequences and UTF-16 surrogates. Several points to be noted here:
Alternate Methods of Validation
Equivalent FunctionsSome common text handling fuctions do not work directly in UTF-8 and have equivalent multibyte functions. Some of the more common equivalents are listed below:
Regular ExpressionsThe PCRE regular expressions require a pattern modifier of 'u' to make the PCRE engine aware that UTF-8 is being used. The POSIX regular expressions have equivalent multibyte functions such as below:
MySQL DatabaseYou must use at least MySQL 4.1, for Unicode support . When creating a database for PHP and UTF-8, use the command: CREATE database_name DEFAULT CHARACTER SET utf8; Note: There is no '-' (dash) in 'utf8' for MySQL. All tables and character columns built after this will default to use the UTF-8 character set. If you have an existing database converted to UTF-8, or create individual tables with UTF-8 columns, we have found that you must also set the database to UTF-8 to avoid problems. ALTER database_name DEFAULT CHARACTER SET utf8; When connecting to MySQL with PHP, you should tell MySQL, what character set to expect by using two commands: mysql_query( "SET NAMES utf8", $database_connection ); MySQL will then expect input data to be in UTF-8, and will output results in UTF-8. It is possible to set and have a different connection character set than the back end database character set. MySQL will convert seamlessly between them, however characters not available in one, or other character set will be converted to '?'. PostgreSQL DatabasePostgreSQL has good UTF-8 support. Ideally, you should create databases with UTF-8 encoding: CREATE DATABASE database_name WITH ENCODING 'UTF8'; After connecting, PHP has a built-in function for client encoding: pg_set_client_encoding( $database_connection, 'UTF8' ); Note: This function returns -1 for an error condition, rather than the 0, or boolean false that would be usual. You can also use SQL commands: SET CLIENT_ENCODING TO 'UTF8'; Or you can use the standard SQL syntax SET NAMES: SET NAMES 'UTF8'; It is possible to set and have a different connection character set than the back end database character set. The PostgreSQL client will convert seamlessly between them, however characters not available in one, or other character set will be converted to '?'. PostgreSQL checks the validity of UTF-8 on input, and will abort with an error message if an invalid byte is found. LinksUTF-8 SamplerUTF-8 and Unicode FAQ UTF-8Test Page Comments, Criticisms and Suggestionsandrewsimpson at users dot sourceforge dot net |