UTF(6) UTF(6) NAME UTF, Unicode, ASCII, rune - character set and format DESCRIPTION The Inferno character set and representation are based on the Unicode Standard and on the ISO multibyte UTF-8 encoding (Universal Character Set Transformation Format, 8 bits wide). The Unicode Standard represents its characters in 21 bits; UTF-8 represents such values in an 8-bit byte stream. Throughout this manual, UTF-8 is shortened to UTF. Internally, programs store individual Unicode characters as 32-bit integers, of which only 21 bits are currently used. Documentation often refers to them as `runes', following Plan 9. However, any external manifestation of textual information, in files or at the interface between programs, uses the machine-independent, byte-stream encoding called UTF. UTF is designed so the 7-bit ASCII set (values hexadecimal 00 to 7F), appear only as themselves in the encoding. Char- acters with values above 7F appear as sequences of two or more bytes with values only from 80 to FF. The UTF encoding of the Unicode Standard is backward compat- ible with ASCII: programs presented only with ASCII work on Inferno even if not written to deal with UTF, as do programs that deal with uninterpreted byte streams. However, pro- grams that perform semantic processing on characters must convert from UTF to runes in order to work properly with non-ASCII input. Normally, all necessary conversions are done by the Limbo compiler and execution envirnoment, when converting between array of byte and string , but sometimes more is needed, such as when a program receives UTF input one byte at a time; see sys-byte2char(2) for routines to handle such processing. Letting numbers be binary, a rune x is converted to a multi- byte UTF sequence as follows: 01. x in [000000.00000000.0bbbbbbb] → 0bbbbbbb 10. x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb 11. x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb 100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb Conversion 01 provides a one-byte sequence that spans the ASCII character set in a compatible way. Conversions 10, 11 and 100 represent higher-valued characters as sequences of Page 1 Plan 9 (printed 11/23/24) UTF(6) UTF(6) two, three or four bytes with the high bit set. Inferno does not support the 5 and 6 byte sequences proposed by X- Open. When there are multiple ways to encode a value, for example rune 0, the shortest encoding is used. In the inverse mapping, any sequence except those described above is incorrect and is converted to the rune hexadecimal FFFD. FILES /lib/unicode table of characters and descriptions, suit- able for look(1). SEE ALSO ascii(1), tcs(1), sys-byte2char(2), keyboard(6), The Unicode Standard. Page 2 Plan 9 (printed 11/23/24)