RUNECOMP(2) RUNECOMP(2)
NAME
norminit, normpull, runecomp, runedecomp, runegbreak,
runewbreak, utfcomp, utfdecomp, utfgbreak, utfwbreak -
multi-rune graphemes
SYNOPSIS
#include <u.h>
#include <libc.h>
typedef struct Norm Norm;
struct Norm {
... /* internals */
};
void norminit(Norm *n, int comp, void *ctx, long (*getrune)(void *ctx));
long normpull(Norm *n, Rune *dst, long max, int flush);
long runecomp(Rune *dst, long ndst, Rune *src, long nsrc);
long runedecomp(Rune *dst, long ndst, Rune *src, long nsrc);
long utfcomp(char *dst, long ndst, char *src, long nsrc);
long utfdecomp(char *dst, long ndst, char *src, long nsrc);
Rune* runegbreak(Rune *s);
Rune* runewbreak(Rune *s);
char* utfgbreak(char *s);
char* utfwbreak(char *s);
DESCRIPTION
These routines handle UnicodeĀ® abstract characters that span
more than one codepoint. Normalization can be used to turn
all codepoints into a consistent representation. This may be
useful if a specific protocol requires normalization, or if
the program is interested in semantically comparing irregu-
lar input.
The Norm structure is the core structure for all normaliza-
tion routines. Norminit initializes the structure. If the
comp argument is non-zero, the output will be normalized to
NFC (precomposed runes), otherwise it will be normalized to
NFD (decomposed runes). The getrune argument provides the
input for normalization, with each call returning the next
rune of input, and -1 on EOF. The ctx argument is stored
and passed on to the getrune function in every call.
Normpull provides the normalized output, writing at most max
elements into dst. To implement normalization the Norm
structure must buffer input until it knows that the context
for a given base rune is complete. In order to accommodate
callers which only have chunks of data to normalize at a
Page 1 Plan 9 (printed 3/12/26)
RUNECOMP(2) RUNECOMP(2)
time, the Norm structure maintains runes within its buffer
even when getrune returns an EOF. The flush argument to
normpull changes this behavior, and will instead flush out
all runes within the structure's buffer when it receives an
EOF from getrune. The return value of normpull is the num-
ber of runes written to the output. Normpull does not
null-terminate the output string, however, null bytes are
passed through untouched. As such, if the input is null
terminated, so is the output.
Runecomp, runedecomp, utfcomp, and utfdecomp, are abstrac-
tions on top of the Norm structure. They are designed to
normalize fixed-sized input in one go. In all functions src
and dst specify the source and destination strings respec-
tively. The nsrc and ndst arguments specify the number of
elements to process. Functions will never read more than
the specified input, and will never write more than the
specified output. If there is not enough room in the output
buffer, the result is truncated. The return value is like-
wise the number of elements written to the output string.
Like normpull, these functions do not explicitly null termi-
nate the output, and pass null bytes through untouched.
The standard for normalization does not specify a maximum
number of decomposed attaching runes that may follow a base
rune. In order to implement normalization, within a bounded
amount of memory, these functions implement a subset of nor-
malization called Stream-Safe Text. This subset specifies
that one base rune may have no more than 30 attaching runes.
In order to break up input that contains runs of more than
30 attaching runes, these functions will insert the Combin-
ing Grapheme Joiner (U+034F) to provide a new base for the
remaining combining runes.
Runegbreak (runewbreak) return the next grapheme (word)
break opportunity in s, or s if none is found. Utfgbreak
and utfwbreak are UTF variants of these routines.
SOURCE
/sys/src/libc/port/mkrunetype.c
/sys/src/libc/port/runenorm.c
/sys/src/libc/port/runebreak.c
SEE ALSO
UnicodeĀ® Standard Annex #15
UnicodeĀ® Standard Annex #29
rune(2), utf(6), tcs(1)
HISTORY
This implementation was first written for 9front (March,
2023). The implementation was rewritten (in part) for Uni-
code 16.0 (March, 2025).
Page 2 Plan 9 (printed 3/12/26)