Quick Guide to understanding Unicode Data Transfer Formats
by Paul Hsieh
What follows is the most concise way I can think of for presenting what one
needs to know to understand just the raw Unicode (aka ISO 10646) data formats.
This is where everyone needs to start with understanding Unicode, and for
people writing just text processing (as opposed to text viewing) tools, there
is no need to proceed further than this. So in the interest of conciseness,
let's just get right into it.
For the purposes of data transfer, the main thing that Unicode specifies are
"code points" which are values in the range from 0x0 to 0x10FFFF with a
couple of missing holes. For the encoding of ordinary ASCII text each code
point corresponds to a single character. However, for more complicated
languages, multiple code points may be used in combination to describe a
text element (such as a word, or character.) Thus "code points" should not
be thought of as exactly synonymous with characters, though they are in many
cases.
Although there is only one up to date Unicode standard, there are various
popular encoding formats. These are are described in the table below.
32 bit value
|
0:7F
|
80:FF
|
100:7FF
|
800:D7FF
|
D800:DFFF
|
E000:FFFD
|
FFFE:FFFF
|
10000:10FFFF
|
110000:16887F
|
168880:7FFFFFFF
|
80000000:FFFFFFFF
|
UCS-4
|
4 octets
|
Invalid
|
UTF-32
|
4 octets
|
Invalid
|
4 octets
|
Invalid
|
4 octets
|
Invalid
|
UCS-2/BMP
|
2 octets
|
Invalid
|
2 octets
|
Invalid
|
No encoding
|
UTF-16
|
2 octets
|
No encoding
|
2 octets
|
Invalid
|
4 octets
|
No encoding
|
Latin1
|
1 octet
|
No encoding
|
7-bit ASCII
|
1 octet
|
Invalid
|
No encoding
|
UTF-8
|
1 octet
|
2 octets
|
3 octets
|
Invalid
|
3 octets
|
Invalid
|
4 octets
|
Invalid
|
No encoding
|
GB18030
|
1 octet
|
2 or 4 octets
|
No encoding
|
2 or 4 octets
|
Invalid
|
4 octets
|
Invalid
|
No encoding
|
octet is just a more specific way of saying "8 bit byte".
The red areas marked Invalid are values for which the encoding could
represent them, however are considered illegal values in that encoding due to
the fact that it does not represent a valid value. The green areas are those
for which there is a valid encoding which represents a valid value in that
encoding. UCS-4 allows for values which are not valid Unicode code points,
while all of the UTF formats shown above precisely specify the valid
Unicode range. The orange areas marked No encoding are
ranges for which it is impossible to encode such a value in the encoding
making the question of their validity a moot point.
Note that there is no conflict between these mappings and that each individual
value has a 1-1 mapping with any other format's encoding of that same value if
one exists. This is not true of formats like big5 which contains
several character pairs which map to a single unicode character as well as
several which do not map to any valid unicode character. Other formats like
UTF-7 have multiple ways of encoding the same values. For these reasons, and
because others have not gained a lot of traction (like UTF-1) only the
encodings shown in the table above will be discussed.
Other things worth noting:
- 7-bit ASCII and UCS-2 are both insufficient for encoding the whole
valid Unicode range.
- 7-bit ASCII and UTF-8 are identical for values less than 0x80.
- UCS-2 and UTF-16 are identical for values less than 0x10000.
- UCS-4 and UTF-32 are identical encodings, whose differences are just in
interpretation of valid values.
- 7-bit ASCII, UCS-2, UCS-4 and UTF-32 are direct numerical encodings
of unicode code point values into a space of one, two, four and four
octets respectively.
- the UTF-8 format could encode the entire legal range of UCS-4, however,
it is simply truncated in its legal range which removes two of its encoding
modes (5 and 6 byte encodings.)
- The only reason the range D800:DFFF is invalid is because of UTF-16's
inability to encode it.
- FFFE is an invalid code point because it is used to detect an
endian (byte order) mismatch in UTF-16 (U-FEFF, the BOM (Byte
Order Mark) character, is a character expected at the beginning of a UTF-16
stream.)
- There is no current necessary reason for FFFF to be an invalid
code point. Although this can be used as an escape character to extend UTF-16
in the future, should it be necessary, a larger value formed from a surrogate
pair that is currently unassigned could also be used.
- The range D800:DFFF are invalid code points because they are used
as escape characters in UTF-16 (called surrogates.)
- While 8-bit pseudo-ASCII has been used on various platforms and transfer
mechanisms, it is not considered standard. Only Latin1 maps correctly.
- While the bulk of the Unicode code point space is in the range
U-10000:U-10FFFF the most commonly used characters are in the
U-0:U-FFFF range.
- GB18030 is a Chinese government format that extends the previous standards
of GB2312 and GBK to also include the rest of the Unicode code space. In
terms of encoding sizes, it is comparable to UTF-8 on average, however, it is
a very complicated mapping. (It is not explained further here.)
- Each of these encoding formats satisfy some other properties. It is
otherwise fairly straight-forward to make 21 bit packed formats, or 8/16/24
bit variable length formats which are more efficient in terms of space and
encoding/decoding.
Because of their heritage and their improved efficiency over UTF-32, UTF-8 and
UTF-16 are the most common Unicode encoding formats used for data transfer.
Between the latter two there are various arguments as to which is better. UTF-16 is
a little faster to decode and encode, its directly backward compatible with
UCS-2, and more efficient at encoding most typical east asian text. UTF-8
does not have endianness issues, its directly backward compatible with ASCII,
and more efficient at encoding western text.
So as one can easily calculate from the table above there are 1112062 valid
unicode encodings. As of the most current Unicode standard, only roughly
100000 of these have actually been assigned to individually named universal
values. Amongst the encodings shown above, all are trivial except for UTF-8
and UTF-16 which we now go into more depth on.
The UTF-8 mapping
A UTF-8 mapping takes valid Unicode code point values and translates them
into one or more octets. An encoder will simply write the octets in
sequential order, and a decoder will read the octets one at a time and try to
fit them to a reverse mapping. The mapping from a valid Unicode code point
value x (= x20x19x18x17x16x15x14x13x12x11x10x9x8x7x6x5x4x3x2x1x0
in binary notation) to UTF-8 is as follows:
if U-0 < x ≤ U-7F then UTF-8(x) =
|
if U-80 < x ≤ U-7FF then UTF-8(x) =
|
if U-800 < x ≤ U-FFFF then UTF-8(x) =
|
if U-10000 < x ≤ U-10FFFF then UTF-8(x) =
|
In the original UTF-8 encoding (which was intended to map to the entire UCS-4
range), 26bit -> 5 byte and 31bit -> 6 byte encodings were also specified
following the pattern from above. This is important when considering the
resynch property.
Properties of UTF-8
- Not all possible output octet combinations are possible. In particular
any octet whose value is ≥ F8 is not possible. Furthermore, false
aliases such as C1 BF which is using the 80:7FF
range encoding to encode U-7F rather than the 0:7F range is not
allowed. (In general, only the shortest encoding amongst various aliases are
allowed.) The only valid encoding outputs are those produced from the
mapping given above. Decoders must detect when an invalid encoding output
has been encountered.
- By examining the top two bits of each octet alone, it is possible to
determine which of 3 modes a UTF-8 octet is part of (00 and 01
are ASCII, 11 is the start of a multi-byte encoding, and 10 is a
non-starting byte of a multi-byte encoding.) When an error has been
encountered when decoding UTF-8, if the policy is not to halt the decoding,
then a resynch can be performed by scanning up to 5 bytes (allowing for
dealing potentially obsolete encoders which mapped to values beyond the
Unicode code point range) after the point of the error until the top two bits
of the octet are not 10.
- Detecting the invalid code point ranges D800:DFFF,
FFFE:FFFF, 110000:1FFFFF cannot be done from the UTF-8 mapping
by itself. Direct examination of the code point value for these potentially
invalid values must be done.
- Since the encoding is octet based there are no endianness issues. In
particular, leading BOM characters (U-FEFF) are unnecessary and do not
imply anything about the data stream.
- If the octets from a valid UTF-8 stream are viewed as unsigned 2s
complement 8 bit values, then the lexical sorting order of UTF-8 is identical
to the sorting order of UTF-32. Note that this is not the same as collating.
- There is no special EOF character. In particular control characters in
the ASCII range don't specify anything other than raw code point data values.
The UTF-16 mapping
A UTF-16 mapping takes valid Unicode code point values and translates them
into one or two 16 bit values. Each 16 bit value is encoded as a pair of
octets. An encoder will simply write the 16 bit values in sequential order,
and a decoder will read the 16 bit values one at a time and try to fit them
to a reverse mapping.
The mapping from a valid Unicode code point value x
(= x20x19x18x17x16x15x14x13x12x11x10x9x8x7x6x5x4x3x2x1x0 in binary notation)
to UTF-16 is as follows:
if U-0 < x ≤ U-FFFF then UTF-16(x) =
x15x14x13x12x11x10x9x8x7x6x5x4x3x2x1x0
|
|
|
if U-10000 < x ≤ U-10FFFF then UTF-16(x = y + 0x10000) =
1 1 0 1 1 0 y19y18y17y16y15y14y13y12y11y10
|
|
1 1 0 1 1 1 y9y8y7y6y5y4y3y2y1y0
|
|
|
Notice the value shift when encoding values in the 10000:10FFFF range
(i.e., a value y is first computed by subtracting 0x10000 from x, then the
bits of y are encoded as shown). This prevents redundant encodings.
Properties of UTF-16
- There is no way to encode values from the range D800:DFFF as these
bit patterns (called surrogate pairs) are used as escapes for encoding the
10000:10FFFF range. When encoding to UTF-16, invalid code points in
the range D800:DFFF must be detected and (regardless of policy) such
values must not be output. (Editorial note: Interestingly, there is no
reuse of any of the underutilized top portion of the extended surrogate range
to map back to this missing hole. Doing so would have allowed UTF-8 to
recover 2048 useful values in its shorter encodings (remember, UTF-8, and
Unicode in general threw out values solely in deferrence to the limitations
of the UTF-16 encoding). This should not be surprising, as UTF-8 and UTF-16
were developed independently, however, it does appear as though UTF-8
conceeded part of its shorter mappings, while UTF-16 did nothing to facilitate
compatibility between them.)
- The surrogates are always in pairs starting with one in the range
D800:DBFF and followed immediately by one in the range DC00:DFFF.
If an erroneous leading DC00:DFFF is encountered while decoding
UTF-16, the decoder may attempt to skip the error (policy allowing) by
skipping this single 16bit value. If a leading D800:DBFF is
encountered without a following DC00:DFFF surrogate, the decoder may
attempt to skip the error (policy allowing) by skipping this single 16bit
value.
- Any encountered value FFFF is invalid. Any encountered FFFE
is invalid if it does not appear at the start of the UTF-16 stream.
- Since the encoding is 16bit based there is an endianness issue. Ordinary
UTF-16 streams start with the U-FEFF BOM character. If an apparent
FFFE value appears at the start, the decoder should assume that it's
endianness is mismatched with the data stream. When the decoder encounters
an endian mismatch, it should reverse the endianess of each 16 bit octet pair
consumed. This endian state should be tracked in the decoding process. If
processing a multi-part UTF-16 stream, the BOM character may be omitted in
subsequent parts following the first one, however the decoder should expect
that the endianness is consistent amongst the parts.
A starting BOM character should be output only in a UTF-16 encoder. It has no
meaning to any of the other encoders. All UTF-16 encoders should output the
BOM character at the beginning of the first part of any UTF-16 stream, and
maintain a consistent endianess throughout the encoding process for a single
given stream.
- A UTF-16 stream should always be composed of an even number of octets. If
an odd number of octets are encountered, the last one should be considered an
erroneously encoded element (though multi-part semantics may complicate this;
but that is beyond what UTF-16 specifies.)
- A UTF-16 has no useful lexical sorting properties relative to the other
Unicode encoding formats.
- There is no special EOF character. In particular control characters in
the ASCII range don't specify anything other than raw code point data values.
Unicode private areas
In addition to specifying a number of convoluted ways in which code points
can be combined, sized, which direction they flow in visible representations,
how they are collated, and giving each defined character a name, there are
ranges specified for private use.
These ranges are Area-A which is U-F0000:U-FFFFD and
Area-B which is U-100000:U-10FFFD. These 131066 code points are
meant to be used for developers to encode application specific meta data.
I.e., ordinary text should never include values in these ranges. For example,
if encoders/decoders have a policy of tagging errors encountered rather than
simply halting, then they can make a guess as to what character is suspected
to have been attempted to be encoded if its in a small enough range (which it
typically will be) and add the base value U-F0000 to it, and insert
that instead. Then the errors can still be dealt with in some subsequent
step and after the rest of the message has been decoded.
Unicode
What I have presented above is not the meat of the content of Unicode. These
are merely commonly used transfer formats. If you wanted to you could invent
your own encoding format, and it would not make any difference to the Unicode
Standard. The real guts of unicode start from the mappings from code points
to actual glyphs and graphemes. Some graphemes (a generalized notion of a
character) can be encoded as a sequence of code points in multiple ways.
These different encodings of the same grapheme still need to be treated as
though they are just representing that one grapheme. The standard also gives
widths, direction, and other character attributes.
Other Resources