UTF-1
UTF-1 is one way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design, it is not possible to resynchronise if decoding starts in the middle of a character (this makes truncation hard, among other things) and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of division by a number which is not a power of 2. Due to these issues, UTF-1 never gained wide acceptance and has been replaced by UTF-8.
Design
UTF-1 is a multi-byte encoding like UTF-8; a single Unicode code point can be encoded in one, two, three, or five octets. While the ASCII range is encoded as one octet, as in UTF-8, the ASCII octets 0x21 - 0x7E (decimal 33 - 126) are also used in UTF-1 multi-byte encodings; therefore UTF-1 is unsuited for many Internet protocols, including MIME.
UTF-1 does not use the C0 and C1 control codes in other encodings – any 0x00–0x20 or 0x7F–0x9F octet stands for the corresponding code points in ISO-8859-1 (U+0000–0020 and U+007F–009F, respectively). This design with 66 protected octets tried to be ISO 2022 compatible.
The UTF-1 encoding scheme uses "modulo 190" arithmetic (256 - 66 = 190); it was designed to encode the complete 31 bits of the original Universal Character Set (UCS-4). For comparison, UTF-8 protects all 128 ASCII octets, and needs two bits in trailing bytes of multi-byte encodings for this purpose, resulting in "modulo 64" arithmetic (8 - 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 - 13 = 243).
codepoint | UTF-16BE | UTF-16LE | UTF-8 | UTF-1 |
---|---|---|---|---|
U+007F | 00 7F | 7F 00 | 7F | 7F |
U+0080 | 00 80 | 80 00 | C2 80 | 80 |
U+009F | 00 9F | 9F 00 | C2 9F | 9F |
U+00A0 | 00 A0 | A0 00 | C2 A0 | A0 A0 |
U+00BF | 00 BF | BF 00 | C2 BF | A0 BF |
U+00C0 | 00 C0 | C0 00 | C3 80 | A0 C0 |
U+00FF | 00 FF | FF 00 | C3 BF | A0 FF |
U+0100 | 01 00 | 00 01 | C4 80 | A1 21 |
U+015D | 01 5D | 5D 01 | C5 9D | A1 7E |
U+015E | 01 5E | 5E 01 | C5 9E | A1 A0 |
U+01BD | 01 BD | BD 01 | C6 BD | A1 FF |
U+01BE | 01 BE | BE 01 | C6 BE | A2 21 |
U+07FF | 07 FF | FF 07 | DF BF | AA 72 |
U+0800 | 08 00 | 00 08 | E0 A0 80 | AA 73 |
U+0FFF | 0F FF | FF 0F | E0 BF BF | B5 48 |
U+1000 | 10 00 | 00 10 | E1 80 80 | B5 49 |
U+4015 | 40 15 | 15 40 | E4 80 95 | F5 FF |
U+4016 | 40 16 | 16 40 | E4 80 96 | F6 21 21 |
U+D7FF | D7 FF | FF D7 | ED 9F BF | F7 2F C3 |
U+E000 | E0 00 | 00 E0 | EE 80 80 | F7 3A 79 |
U+F8FF | F8 FF | FF F8 | EF A3 BF | F7 5C 3C |
U+FDD0 | FD D0 | D0 FD | EF B7 90 | F7 62 BA |
U+FDEF | FD EF | EF FD | EF B7 AF | F7 62 D9 |
U+FEFF | FE FF | FF FE | EF BB BF | F7 64 4C |
U+FFFD | FF FD | FD FF | EF BF BD | F7 65 AD |
U+FFFE | FF FE | FE FF | EF BF BE | F7 65 AE |
U+FFFF | FF FF | FF FF | EF BF BF | F7 65 AF |
U+10000 | D8 00 DC 00 | 00 D8 00 DC | F0 90 80 80 | F7 65 B0 |
U+38E2D | D8 A3 DE 2D | A3 D8 2D DE | F0 B8 B8 AD | FB FF FF |
U+38E2E | D8 A3 DE 2E | A3 D8 2E DE | F0 B8 B8 AE | FC 21 21 21 21 |
U+FFFFF | DB BF DF FF | BF DB FF DF | F3 BF BF BF | FC 21 37 B2 7A |
U+100000 | DB C0 DC 00 | C0 DB 00 DC | F4 80 80 80 | FC 21 37 B2 7B |
U+10FFFF | DB FF DF FF | FF DB FF DF | F4 8F BF BF | FC 21 39 6E 6C |
U+7FFFFFFF | - | - | FD BF BF BF BF BF | FD BC 2B B8 40 |
See also
References
- ISO/IEC JTC 1/SC2/WG2 (1993-01-21). "ISO IR 178: UCS Transformation Format One (UTF-1)" (PDF) (PDF, 256 KB) (1 ed.). Registration number 178. Archived from the original (PDF) on 2015-03-18.
- Czyborra, Roman (1998-11-30). "Unicode Transformation Formats: UTF-8 & Co.". Archived from the original on 2016-06-07. Retrieved 2016-06-07.