UCA 4.0 Proposal

SC22/WG20 N1018

L2/02-280

Re:	UCA4.0 Proposal
From:	Mark Davis
Date:	2002-08-06

This document describes proposals for UTS #10: Unicode Collation Algorithm in the Unicode 4.0 timeframe.

Synchronizing with the Unicode Standard. To get the same results in different implementations, one must specify both the version of the UCA and the Unicode version, since the UCA depends upon the latter. This dual version is difficult for users to understand, and supporting multiple combinations is a burden on implementers. In the past, it has been difficult to synchronize versions, but at the point it should not be. So the proposal is that starting with Unicode 4.0, that there be a single version for the UCA which is the same as the base version of the Unicode Standard that it requires. Thus in the 4.0 timeframe we would have UCA 4.0 as well. Thereafter, whenever we add characters to a new version of the Unicode Standard (e.g. major or minor version), we will also issue a new version of the UCA with the same version.
There are enough update versions of the Unicode Standard that adjustments can be made to the UCA if any are required for synchronization with ISO 14651 (although frankly, the same issues govern there: ISO 14651 should be synchronized with ISO 10646: when a new character is published in a new version of the 10646, it should be given collation weights in a simultaneous version of ISO 14651.)
Moving Section 5.17. The introductory material for collation is currently in Section 5.17 Sorting and Searching of the Unicode Standard. This bifurcation is clumsy; that material should be moved to be the introduction to UTS #10: Unicode Collation Algorithm. It could be updated with material from the IUC tutorial on collation by Cathy Wissink, Michael Kaplan, and Mark Davis if we have enough time.
SHY. The use of SHY to prevent character contraction should be documented.
Stability. A discussion of approaches to binary stability of sort keys across versions of the UCA and tailorings should be added.
Hangul. In the current version of UCA, we held off on changing weights to fix the Hangul problem, although we did reserve an area so that people can tailor Hangul to fix it. The following issues have come up with WG20.
1. How should the Compatibility Jamo and half-width Jamo be treated?
  The solution proposed in some WG20 documents is that a sequence of compatibility jamo can be preprocessed into a "canonical" sequence. This does sound reasonable, but we have to be careful of exactly what is involved in the preprocessing step. See Performance below.
  
  Moreover, unless there is syntax to precisely define what an optional preprocessing step must do, syntax that is incorporated into the standard, there is no way to indicated how the preprocessing is or is not to be invoked.
2. Should the Jamo themselves be further decomposed (in terms of weights)?
  
  This is possible. It would allow the sequence of, say:
  
  U+11A8 ( ᆨ ) HANGUL JONGSEONG KIYEOK
  U+11BA ( ᆺ ) HANGUL JONGSEONG SIOS
  
  and the character
  
  U+11AA ( ᆪ ) HANGUL JONGSEONG KIYEOK-SIOS
  
  to sort the same. However, there is also a significant Performance implication here. This may be best handled by special tailoring. (Note: characters like U+11AA ( ᆪ ) HANGUL JONGSEONG KIYEOK-SIOS used to have compatibility decompositions. These were withdrawn because normalization would cause the NFKC form to break up Hangul syllables.)
3. How do we solve the "Trailing_Weights" problem:
  
  This is the trickiest issue, and depends on whether we do #2 or not. Here are the basic premises:
  1. Syllables sort as primary units, meaning that if two syllables are different, no further characters in the string make any difference (in ordering).
  2. Syllables are of the form (L+ V+ T*) (one or more leads, one or more vowels, zero or more trails). I'll call each of the three a cluster: L+, V+ and T*
  3. Note: we do not care much about "degenerate" syllables, such as L alone, as long as the ordering is determinate.
  4. Within a syllable, clusters sort as primary units, meaning that if two clusters are different, no further characters in the syllable make any difference (in ordering).
  5. Within a cluster, any cluster that is a proper initial suffix of an other sorts first. That is, given two clusters (L_a,...,L_n) and (L_a,...,L_n, L_m...L_q), where the L_a...L_n weights are the same in both, the shorter will always sort first.
  6. Note: we'll let X stand for any character that is not a Hangul Jamo.
  Without any change in the architecture, this would have the following implications.
  1. Because L sorts before LL always -- no matter what is after either -- LV must sort before LL. So the weighting of all V's must always be less than all L's.
  2. Similarly, all T's must weight before all V's.
  3. Because all syllables sort as units, LVX and LVT must sort before LVVX, . Thus for all X, X < V and X < T.
  4. But also because syllables sort as units, LVL must sort before LVT. Thus L < T
  Thus to get the right ordering, we must have X < T < V < L, and also L < T: a contradiction. It is easy enough to make X < T < V be true by ordering the weight values. But T < V < L implies T < L, which contradicts L < T.
  
  To resolve this problem there are two possible solutions. There are pluses and minuses to both of these approaches.
  
  Break Insertion. In some ways, the easiest change would be to always insert a special weight ("S") after the end of every syllable. This special weight would always have to be less than every non-Hangul character. (I think this is close to one of Kent's original proposals.) It would involve some extra processing, and either some extra syntax to specify this relation, or a hard-coded algorithm step. Once this was done, we would never have a problem with iii, or iv, since we would always be comparing
  
  LVSX < LVT or
  LVSL < LVT
  Impact. A hard-coded algorithm step is not a good idea, since there may be other syllabic-sorting scripts that would need the same kind of solution. We could have some syntax like:
```
@Break Between [YYYY-ZZZZ,WWWW-MMMM...] And [JJJJ-KKKK,QQQQ-RRRR...] ;
```
  where [YYYY-ZZZZ...] are ranges of Unicode/10646 characters. We also define a reserved weight S which is lower than any letters. One can then use one such command to put S between V and anything but V or T, and another such command to put S between T and anything but T.
  
  Note: These would not be full regular expressions: simply a list of characters before and a list of characters after each possible insertion point.
  
  It would also involve some extra processing on every character in processing, since the before character list needs to be tested for (however, this can be reduced by tagging these characters in the CE table.
  
  Sort keys are also about 40% longer in primary weights (3.5 vs 2.5 weights), although the total length need not be 40% longer, since the special collation element could have zero level2 and level3 weights. So approximately 20% longer (and commensurately slower, see Performance).
  High Trailing Weights. If we always ensure that any LL (that is used in practice) always contracts to a single weight, then we never get condition (i). If so, then we don't get a contradiction; the condition turns into: X < T < V and L < T. This latter condition can be easily satisfied by putting all T and V weights higher than any other weight. This was the solution proposed and accepted by the UTC, but which we have put on hold until we could discuss it with the WG20 members. However, the UCA has been altered to reserve an area so that implementers can easily tailor Jamo to this area, resolving the problem in the tailoring.
  
  Impact. This approach works, but is not as general as Break Insertion, since any LL (that we care about) has to contract. Contractions cost in performance (although if this is only active in a special tailored table for ancient Hangul, it is not much of a concern).
  
  Also, having L in a different range than T and V would break at least some kinds of sort key compression, rendering the sort key longer. The advantage is that no new syntax needs to be added (even for use by other syllable-sorting languages), and no change to the fundamental algorithm is necessary.
  
  There are a number of choices here:
  1. Allow High Trailing Weights in tailoring (already available)
  2. Change the default table to use High Trailing Weights.
  3. Allow Break Insertion in tailoring
  4. Change the default table to use Break Insertion
  To resolve the problem in the default table, we must do either (II) or (IV). Notice that (I) and (III) can coexist.
  
  Performance
  
  We have to be careful of exactly what is involved in any change, especially involving a preprocessing step. Why is that? People around the world are extremely sensitive to the performance of collation. Unless "preprocessing" is designed correctly -- so that in practice it can actually be done incrementally, and without large changes in architecture -- it could slow down processing very significantly. And in practice, it would not be accepted by people in the market place.
  
  The incremental nature is very important. I'll give an example. In most production systems, there are two separate code paths: (1) sort key generation, and (2) direct string comparison. (I'm sure you are all aware of this, but I will reiterate just to make sure we are all in sync.)
  1. sort key: preprocess a string to produce a sequence of bytes. Later, such sequences of bytes can be compared using a binary compare.
  2. direct compare: process both strings incrementally; the comparison produces the same results as a binary compare of sort keys, but no sort key is produced.
  The important features are that on average, a direct string comparison over real data is something like 20 times faster than the production of a sort key (and may be even faster). This is because typical comparisons are actually decided in the first few characters, and the rest of the processing of the string is normally skipped. Unless a very large number of comparisons are made, the direct string comparison wins, and is commonly used. But incremental comparison is very performance sensitive; one can't even call a strlen() function on the text, since that screws up the performance.
  
  The other feature is sort key length. If we end up with sort keys that are very significantly longer than what is done right now, it becomes a significant performance drain: in terms of the generation of the sort key, the extra storage it uses, the increase in swapping to read sort keys in and out of memory. Since the same weight data is used in incremental processing, longer weights also mean slower incremental comparison.
  
  Anyway, we just have to be very careful of the implications for performance in everything we end up with for Hangul, since if it is too slow -- and only marginally better as far as most Koreans are concerned -- then it will end up being ignored (e.g. tailored away).

Performance