“Character”-by-“character” string processing is hard, people

I bet you did not believe me when I wrote in Swift thoughts about how it is hard to properly process strings when treating them as a sequence of Unicode code points, and that as a result text is better thought of as a media flow and strings better handled through the few primitives I mentioned, which never treat strings using any individual entity (be this entity the byte, the UTF-16 “character”, the Unicode code point, or the grapheme cluster). I am exaggerating, of course, some of you probably did believe me, but given how I still see string processing being discussed between software developers, this is true enough.

So go ahead and read the latest post in the Swift blog, about how they changed the String type in Swift 2, and the fact that it is no longer considered a collection (by no longer conforming to the CollectionType protocol), because a collection where appending an element (a combining acute accent, or “´”) not only does not result in the element being considered part of the collection, but also results in elements previously considered part of it (the letter “e”) to no longer be, is a pretty questionable collection type. Oops.

But that is not the (most) interesting aspect of that blog post.

Look at the table towards the end, which is supposed to correspond to a string “comprised of the decomposed characters [ c, a, f, e ] and [ ´ ]”, and which I am reproducing here, as an actual HTML table as Tim Berners-Lee intended, for your benefit (and because I am pretty certain they are going to correct it after I post this):

Character

c

LATIN SMALL LETTER C
U+0063

a

LATIN SMALL LETTER A
U+0061

f

LATIN SMALL LETTER F
U+0066

é

LATIN SMALL LETTER E WITH ACUTE
U+00E9

Unicode Scalar Value

c

a

f

e

´

UTF-8 Code Unit

99

97

102

101

204

UTF-16 Code Unit

99

97

102

769

The first thing you will notice is the last element of the Character view, the whole row in fact. Why are they described by a Unicode code point each? Indeed, each of these elements is an instance of the Swift Character type, i.e. a grapheme cluster, which can be made up of multiple code points, and this is particularly absurd in the case of the last one, which corresponds to two Unicode code points. True, it would compare equal with a Swift Character containing a single LATIN SMALL LETTER E WITH ACUTE, but that is not what it contains. And yet, this is only the start of the problems…

If we take the third row, its last element is incorrect. Indeed, 204, or 0xCC ($CC for the 68k assembly fans in the audience) is only the first byte of the UTF-8 serialization of U+0301 (COMBINING ACUTE ACCENT) that you see in the previous row (which is correct, amazingly), the second being $81.

And lastly, if the last two column are two separate Unicode scalar values, how could they possibly be represented by a single UTF-16 scalar? Of course, they can’t: 769 is $0301, our friend the combining acute accent. “e” is simply gone.

So out of 4 rows, 3 are wrong *slow clap*. So here is the correct table:

Character

c

a

f

Unicode Scalar Value

c

LATIN SMALL LETTER C
U+0063

a

LATIN SMALL LETTER A
U+0061

f

LATIN SMALL LETTER F
U+0066

e

LATIN SMALL LETTER E
U+0065

´

COMBINING ACUTE ACCENT
U+0301

UTF-8 Code Unit

$63

$61

$66

$65

$CC

$81

UTF-16 Code Unit

$0063

$0061

$0066

$0065

$0301

Note that with the example given, Unicode scalar value match one for one with the UTF-16 scalars in the sequence. For a counterexample to be provided, the string would have to include Unicode code points beyond the Basic Multilingual Plane — a land populated by scripts no longer in general usage (hieroglyphs, Byzantine musical notations, etc.), extra compatibility ideographs, invented languages, and other esoteric entities; that place, by the way, is where emoji were (logically) put in Unicode.

Conclusion

If Apple can’t get its “Characters”, UTF-16 scalars, and bytes of a seemingly simple string such as “café” straight in a blog post designed to show these very views of that string, what hope could you possibly have of getting “character”-wise text processing right?

Treat text as a media flow, by only using string processing primitives without ever directly caring about the individual constituents of these strings.

2 thoughts on ““Character”-by-“character” string processing is hard, people

  1. Baking in a canonicalization policy into a basic type is … interesting. I’m sure somewhere someone will be unhappy their password no longer hashes to the same value.

  2. To the best of my knowledge, neither the Swift String type nor NS/CFString enforce any canonicalization: they preserve the composed or decomposed state of diacritics, so any round-trip conversion through these types ought to preserve the sequence of Unicode scalar values; in other words the type is composition-preserving, contrary to HFS+ when it comes to file names, for instance.

    But composition-insensitivity is nevertheless baked in the type given that equality is composition-insentitive, which implies that hashing (not to be confused with password hashing) has to be, too. An interesting consequence is that any conforming implementation of Swift makes the resulting executable have to rely on or bundle a Unicode library like ICU, which means that, contrary to C, Swift is most definitely not for toasters (e.g. on my Mac OS X install libicucore.A.dylib weighs more than 4MB).

Comments are closed.