Reading or Writing Unicode Characters (PHP Cookbook) - e-Reading Library

start page | rating of books | rating of authors | reviews | copyrights

Book Home

16.12. Reading or Writing Unicode Characters

16.12.1. Problem

You want to read Unicode-encoded characters from a file, database, or form; or, you want to write Unicode-encoded characters.

16.12.2. Solution

Use utf8_encode( ) to convert single-byte ISO-8859-1 encoded characters to UTF-8:

print utf8_encode('Kurt Gödel is swell.');

Use utf8_decode( ) to convert UTF-8 encoded characters to single-byte ISO-8859-1 encoded characters:

print utf8_decode("Kurt G\xc3\xb6del is swell.");

16.12.3. Discussion

There are 256 possible ASCII characters. The characters between codes 0 and 127 are standardized: control characters, letters and numbers, and punctuation. There are different rules, however, for the characters that codes 128-255 map to. One encoding is called ISO-8859-1, which includes characters necessary for writing most European languages, such as the ö in Gödel or the ñ in pestaña. Many languages, though, require more than 256 characters, and a character set that can express more than one language requires even more characters. This is where Unicode saves the day; its UTF-8 encoding can represent more than a million characters.

This increased functionality comes at the cost of space. ASCII characters are stored in just one byte; UTF-8 encoded characters need up to four bytes. Table 16-2 shows the byte representations of UTF-8 encoded characters.

Table 16-2. UTF-8 byte representation

Character code range	Bytes used	Byte 1	Byte 2	Byte 3	Byte 4
`0x00000000 - 0x0000007F`	1	`0xxxxxxx`
`0x00000080 - 0x000007FF`	2	`110xxxxx`	`10xxxxxx`
`0x00000800 - 0x0000FFFF`	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
`0x00010000 - 0x001FFFFF`	4	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

In Table 16-2, the x positions represent bits used for actual character data. The least significant bit is the rightmost bit in the rightmost byte. In multibyte characters, the number of leading 1 bits in the leftmost byte is the same as the number of bytes in the character.

16.12.4. See Also

Documentation on utf8_encode( ) at http://www.php.net/utf8-encode and utf8_decode( ) at http://www.php.net/utf8-decode; more information on Unicode is available at the Unicode Consortium's home page, http://www.unicode.org; the UTF-8 and Unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html is also helpful.

Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.