What is UTF-8 ?
UTF-8 is a character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes. The encoding is defined by the Unicode standard, and was originally designed by Ken Thompson and Rob Pike. The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" in filenames, "\" in escape sequences, and "%" in printf.
Shows the usage of the main encodings on the web from 2001 to 2012 as recorded by Google, with UTF-8 overtaking all others in 2008 and nearing 50% of the web in 2012.
Note that the ASCII only figure includes web pages with any declared header if they are restricted to ASCII characters.
UTF-8 has been the dominant character encoding for the World Wide Web since 2009, and as of November 2017 accounts for 90.1% of all Web pages. (The next-most popular multibyte encodings, Shift JIS and GB 2312, have 0.8% and 0.6% respectively).The Internet Mail Consortium (IMC) recommended that all e-mail programs be able to display and create mail using UTF-8 and the W3C recommends UTF-8 as the default encoding in XML and HTML
Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point. The following table shows the structure of the encoding. The
x characters are replaced by the bits of the code point. If the number of significant bits is no more than seven, the first line applies; if no more than 11 bits, the second line applies, and so on.
The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
Some of the important features of this encoding are as follows:
- Backward compatibility: Backwards compatibility with ASCII and the enormous amount of software designed to process ASCII-encoded text was the main driving force behind the design of UTF-8. In UTF-8, single bytes with values in the range of 0 to 127 map directly to Unicode code points in the ASCII range. Single bytes in this range represent characters, as they do in ASCII. Moreover, 7-bit bytes (bytes where the most significant bit is 0) never appear in a multi-byte sequence, and no valid multi-byte sequence decodes to an ASCII code-point. A sequence of 7-bit bytes is both valid ASCII and valid UTF-8, and under either interpretation represents the same sequence of characters. Therefore, the 7-bit bytes in a UTF-8 stream represent all and only the ASCII characters in the stream. Thus, many text processors, parsers, protocols, file formats, text display programs etc., which use ASCII characters for formatting and control purposes will continue to work as intended by treating the UTF-8 byte stream as a sequence of single-byte characters, without decoding the multi-byte sequences. ASCII characters on which the processing turns, such as punctuation, whitespace, and control characters will never be encoded as multi-byte sequences. It is therefore safe for such processors to simply ignore or pass-through the multi-byte sequences, without decoding them. For example, ASCII whitespace may be used to tokenize a UTF-8 stream into words; ASCII line-feeds may be used to split a UTF-8 stream into lines; and ASCII NUL characters can be used to split UTF-8-encoded data into null-terminated strings. Similarly, many format strings used by library functions like "printf" will correctly handle UTF-8-encoded input arguments.
- Fallback and auto-detection: UTF-8 provided backwards compatibility for 7-bit ASCII, but much software and data uses 8-bit extended ASCII encodings designed prior to the adoption of Unicode to represent the character sets of European languages. Part of the popularity of UTF-8 is due to the fact that it provides a form of backward compatibility for these as well. A UTF-8 processor which erroneously receives an extended ASCII file as input can "fall back" or replace 8-bit bytes using the appropriate code-point in the Unicode Latin-1 Supplement block, when the 8-bit byte appears outside a valid multi-byte sequence. Though it does happen, the 8-bit characters in extended ASCII encodings do not usually have the correct form for UTF-8 multi-byte sequences. This is because the 8-bit bytes which introduce multi-byte sequences in UTF-8 are primarily accented letters (mostly vowels) in the common extended ASCII encodings, and the UTF-8 continuation bytes are punctuation and symbol characters. To appear as a valid UTF-8 multi-byte sequence, a series of 2 to 4 extended ASCII 8-bit characters would have to be an unusual combination of symbols and accented letters. In short, extended ASCII character sequences which look like valid UTF-8 multi-byte sequences are unlikely. Fallback errors will be false negatives, and these will be rare. Moreover, in many applications, such as text display, the consequence of incorrect fallback is usually slight. Only legibility is affected, and not significantly. These two things make fallback feasible, if somewhat imperfect. Indeed, as discussed further below, the HTML5 standard requires that erroneous bytes in supposed UTF-8 data be replaced upon display on the assumption that they are Windows-1252 characters. The presence of invalid 8-bit characters outside valid multi-byte sequences can also be used to "auto-detect" that an encoding is actually an extended ASCII encoding rather than UTF-8, and decode it accordingly. A UTF-8 stream may simply contain errors, resulting in the auto-detection scheme producing false positives; but auto-detection is successful in the majority of cases, especially with longer texts, and is widely used.
- Prefix code: The first byte indicates the number of bytes in the sequence. Reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication. The length of multi-byte sequences is easily determined by humans as it is simply the number of high-order 1s in the leading byte. An incorrect character will not be decoded if a stream ends mid-sequence.
- Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with
10 while single bytes start with
0 and longer lead bytes start with
11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
- Sorting order: The chosen values of the leading bytes and the fact that the continuation bytes have the high-order bits first means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.
Consider the encoding of the Euro sign, €.
- The Unicode code point for "€" is U+20AC.
- According to the scheme table above, this will take three bytes to encode, since it is between U+0800 and U+FFFF.
20AC is binary
0010 0000 1010 1100. The two leading zeros are added because, as the scheme table shows, a three-byte encoding needs exactly sixteen bits from the code point.
- Because the encoding will be three bytes long, its leading byte starts with three 1s, then a 0 (
- The four most significant bits of the code point are stored in the remaining low order four bits of this byte (
1110 0010), leaving 12 bits of the code point yet to be encoded (
...0000 1010 1100).
- All continuation bytes contain exactly six bits from the code point. So the next six bits of the code point are stored in the low order six bits of the next byte, and
10 is stored in the high order two bits to mark it as a continuation byte (so
- Finally the last six bits of the code point are stored in the low order six bits of the final byte, and again
10 is stored in the high order two bits (
The three bytes
1010 1100 can be more concisely written in hexadecimal, as
E2 82 AC.
Since UTF-8 uses groups of six bits, it is sometimes useful to use octal notation which uses 3-bit groups. With a calculator which can convert between hexadecimal and octal it can be easier to manually create or interpret UTF-8 compared with using binary.
- Octal 0200–3777 (hex 80-7FF) shall be coded with two bytes. xxyy will be 3xx 2yy.
- Octal 4000–177777 (hex 800-FFFF) shall be coded with three bytes. xxyyzz will be (340+xx) 2yy 2zz.
- Octal 200000-4177777 (hex 10000-10FFFF) shall be coded with four bytes. wxxyyzz will be 36w 2xx 2yy 2zz.