UTF(6) UTF(6)
NAME
UTF, Unicode, ASCII, rune - character set and format
DESCRIPTION
The Inferno character set and representation are based on
the Unicode Standard and on the ISO multibyte UTF-8 encoding
(Universal Character Set Transformation Format, 8 bits
wide). The Unicode Standard represents its characters in 21
bits; UTF-8 represents such values in an 8-bit byte stream.
Throughout this manual, UTF-8 is shortened to UTF.
Internally, programs store individual Unicode characters as
32-bit integers, of which only 21 bits are currently used.
Documentation often refers to them as `runes', following
Plan 9. However, any external manifestation of textual
information, in files or at the interface between programs,
uses the machine-independent, byte-stream encoding called
UTF.
UTF is designed so the 7-bit ASCII set (values hexadecimal
00 to 7F), appear only as themselves in the encoding. Char-
acters with values above 7F appear as sequences of two or
more bytes with values only from 80 to FF.
The UTF encoding of the Unicode Standard is backward compat-
ible with ASCII: programs presented only with ASCII work on
Inferno even if not written to deal with UTF, as do programs
that deal with uninterpreted byte streams. However, pro-
grams that perform semantic processing on characters must
convert from UTF to runes in order to work properly with
non-ASCII input. Normally, all necessary conversions are
done by the Limbo compiler and execution envirnoment, when
converting between array of byte and string , but sometimes
more is needed, such as when a program receives UTF input
one byte at a time; see sys-byte2char(2) for routines to
handle such processing.
Letting numbers be binary, a rune x is converted to a multi-
byte UTF sequence as follows:
01. x in [000000.00000000.0bbbbbbb] → 0bbbbbbb
10. x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
11. x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,
10bbbbbb
100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,
10bbbbbb, 10bbbbbb
Conversion 01 provides a one-byte sequence that spans the
ASCII character set in a compatible way. Conversions 10, 11
and 100 represent higher-valued characters as sequences of
Page 1 Plan 9 (printed 10/28/25)
UTF(6) UTF(6)
two, three or four bytes with the high bit set. Inferno
does not support the 5 and 6 byte sequences proposed by X-
Open. When there are multiple ways to encode a value, for
example rune 0, the shortest encoding is used.
In the inverse mapping, any sequence except those described
above is incorrect and is converted to the rune hexadecimal
FFFD.
FILES
/lib/unicode table of characters and descriptions, suit-
able for look(1).
SEE ALSO
ascii(1), tcs(1), sys-byte2char(2), keyboard(6), The Unicode
Standard.
Page 2 Plan 9 (printed 10/28/25)