RUNECOMP(2)                                           RUNECOMP(2)

     NAME
          norminit, normpull, runecomp, runedecomp, runegbreak,
          runewbreak, utfcomp, utfdecomp, utfgbreak, utfwbreak -
          multi-rune graphemes

     SYNOPSIS
          #include <u.h>
          #include <libc.h>

          typedef struct Norm Norm;
          struct Norm {
                 ...    /* internals */
          };

          void   norminit(Norm *n, int comp, void *ctx, long (*getrune)(void *ctx));
          long   normpull(Norm *n, Rune *dst, long max, int flush);

          long   runecomp(Rune *dst, long ndst, Rune *src, long nsrc);
          long   runedecomp(Rune *dst, long ndst, Rune *src, long nsrc);

          long   utfcomp(char *dst, long ndst, char *src, long nsrc);
          long   utfdecomp(char *dst, long ndst, char *src, long nsrc);

          Rune*  runegbreak(Rune *s);
          Rune*  runewbreak(Rune *s);

          char*  utfgbreak(char *s);
          char*  utfwbreak(char *s);

     DESCRIPTION
          These routines handle UnicodeĀ® abstract characters that span
          more than one codepoint.  Normalization can be used to turn
          all codepoints into a consistent representation. This may be
          useful if a specific protocol requires normalization, or if
          the program is interested in semantically comparing irregu-
          lar input.

          The Norm structure is the core structure for all normaliza-
          tion routines.  Norminit initializes the structure.  If the
          comp argument is non-zero, the output will be normalized to
          NFC (precomposed runes), otherwise it will be normalized to
          NFD (decomposed runes).  The getrune argument provides the
          input for normalization, with each call returning the next
          rune of input, and -1 on EOF.  The ctx argument is stored
          and passed on to the getrune function in every call.
          Normpull provides the normalized output, writing at most max
          elements into dst.  To implement normalization the Norm
          structure must buffer input until it knows that the context
          for a given base rune is complete.  In order to accommodate
          callers which only have chunks of data to normalize at a

     Page 1                       Plan 9              (printed 7/4/25)

     RUNECOMP(2)                                           RUNECOMP(2)

          time, the Norm structure maintains runes within its buffer
          even when getrune returns an EOF.  The flush argument to
          normpull changes this behavior, and will instead flush out
          all runes within the structure's buffer when it receives an
          EOF from getrune.  The return value of normpull is the num-
          ber of runes written to the output.  Normpull does not
          null-terminate the output string, however, null bytes are
          passed through untouched.  As such, if the input is null
          terminated, so is the output.

          Runecomp, runedecomp, utfcomp, and utfdecomp, are abstrac-
          tions on top of the Norm structure. They are designed to
          normalize fixed-sized input in one go.  In all functions src
          and dst specify the source and destination strings respec-
          tively.  The nsrc and ndst arguments specify the number of
          elements to process.  Functions will never read more than
          the specified input, and will never write more than the
          specified output. If there is not enough room in the output
          buffer, the result is truncated.  The return value is like-
          wise the number of elements written to the output string.
          Like normpull, these functions do not explicitly null termi-
          nate the output, and pass null bytes through untouched.

          The standard for normalization does not specify a maximum
          number of decomposed attaching runes that may follow a base
          rune.  In order to implement normalization, within a bounded
          amount of memory, these functions implement a subset of nor-
          malization called Stream-Safe Text.  This subset specifies
          that one base rune may have no more than 30 attaching runes.
          In order to break up input that contains runs of more than
          30 attaching runes, these functions will insert the Combin-
          ing Grapheme Joiner (U+034F) to provide a new base for the
          remaining combining runes.

          Runegbreak (runewbreak) return the next grapheme (word)
          break opportunity in s, or s if none is found.  Utfgbreak
          and utfwbreak are UTF variants of these routines.

     SOURCE
          /sys/src/libc/port/mkrunetype.c
          /sys/src/libc/port/runenorm.c
          /sys/src/libc/port/runebreak.c

     SEE ALSO
          UnicodeĀ® Standard Annex #15
          UnicodeĀ® Standard Annex #29
          rune(2), utf(6), tcs(1)

     HISTORY
          This implementation was first written for 9front (March,
          2023).  The implementation was rewritten (in part) for Uni-
          code 16.0 (March, 2025).

     Page 2                       Plan 9              (printed 7/4/25)