Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 7 U+007F 0xxxxxxx 11 U+07FF 110xxxxx 10xxxxxx 16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
viz Wiki
Jak? Jednoduše! Musíš si uvědomit, čím UTF-? jsou a čím nejsou! Je to způsob, jak kódovat číselné kódy unicode znaků do bajtů. No a unicode prostor má rozsah 000000-10FFFF. Z toho ti plynou dvě věci:
Mimochodem, ne 31 ale 32 bitová, jak jsem již psal. Neboť i když byl unicode prostor původně 31 bitový, nic to nemění na tom, že:
UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit code unit. (Unicode Standard 6.0)
Přesto ale:
As for all of the Unicode encoding forms, UTF-32 is restricted to representation of code points in the range 0-10FFFF - that is, the Unicode codespace. This guarantees interoperability with the UTF-16 and UTF-8 encoding forms. (Unicode Standard 6.0)
The original specification allowed for sequences of up to six bytes, covering numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.
viz Wiki... :)
RFC 2044 (1996): 4 octets
RFC 2279 (1998): 6 octets
RFC 3629 (2003): 4 octets