UTF(6)                                                     UTF(6)

     NAME
          UTF, Unicode, ASCII, rune - character set and format

     DESCRIPTION
          The Inferno character set and representation are based on
          the Unicode Standard and on the ISO multibyte UTF-8 encoding
          (Universal Character Set Transformation Format, 8 bits
          wide).  The Unicode Standard represents its characters in 21
          bits; UTF-8 represents such values in an 8-bit byte stream.
          Throughout this manual, UTF-8 is shortened to UTF.

          Internally, programs store individual Unicode characters as
          32-bit integers, of which only 21 bits are currently used.
          Documentation often refers to them as `runes', following
          Plan 9.  However, any external manifestation of textual
          information, in files or at the interface between programs,
          uses the machine-independent, byte-stream encoding called
          UTF.

          UTF is designed so the 7-bit ASCII set (values hexadecimal
          00 to 7F), appear only as themselves in the encoding.  Char-
          acters with values above 7F appear as sequences of two or
          more bytes with values only from 80 to FF.

          The UTF encoding of the Unicode Standard is backward compat-
          ible with ASCII: programs presented only with ASCII work on
          Inferno even if not written to deal with UTF, as do programs
          that deal with uninterpreted byte streams.  However, pro-
          grams that perform semantic processing on characters must
          convert from UTF to runes in order to work properly with
          non-ASCII input.  Normally, all necessary conversions are
          done by the Limbo compiler and execution envirnoment, when
          converting between array of byte and string , but sometimes
          more is needed, such as when a program receives UTF input
          one byte at a time; see sys-byte2char(2) for routines to
          handle such processing.

          Letting numbers be binary, a rune x is converted to a multi-
          byte UTF sequence as follows:

          01.   x in [000000.00000000.0bbbbbbb] → 0bbbbbbb
          10.   x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
          11.   x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,
          10bbbbbb
          100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,
          10bbbbbb, 10bbbbbb

          Conversion 01 provides a one-byte sequence that spans the
          ASCII character set in a compatible way.  Conversions 10, 11
          and 100 represent higher-valued characters as sequences of

     Page 1                       Plan 9            (printed 11/23/24)

     UTF(6)                                                     UTF(6)

          two, three or four bytes with the high bit set.  Inferno
          does not support the 5 and 6 byte sequences proposed by X-
          Open.  When there are multiple ways to encode a value, for
          example rune 0, the shortest encoding is used.

          In the inverse mapping, any sequence except those described
          above is incorrect and is converted to the rune hexadecimal
          FFFD.

     FILES
          /lib/unicode   table of characters and descriptions, suit-
                         able for look(1).

     SEE ALSO
          ascii(1), tcs(1), sys-byte2char(2), keyboard(6), The Unicode
          Standard.

     Page 2                       Plan 9            (printed 11/23/24)