Automatic Esperanto transliteration

For a bit of fun the other week I created a software library for transliterating EO text between its proper UTF-8 representation, the x-system and the h-system. If you don’t know what I’m talking about then the names of the systems offer a solid clue. It’s a scheme for what to do about letters that are difficult to type on regular keyboards.

For example the phrase “Tomorrow is Thursday. I should buy food.” could be written like so:

Morgaŭ estas ĵaŭdo. Mi devus aĉeti manĝaĵon.
Morgaux estas jxauxdo. Mi devus acxeti mangxajxon.
Morgau estas jhaudo. Mi devus acheti manghajhon.

At first blush this looks like a find-and-replace with delusions of grandeur. Am I simply creating another left-pad? Well, no, there were two somewhat interesting challenges.

The first is ambiguities. The x-system is not affected by this: every letter featuring a diacritic is converted to its ASCII form + x. The letter x does not occur in the Esperanto alphabet, so to do the opposite conversion you simply find all occurrences of x and work backwards.

The h-system is far more fiddly. You can see in the above example that ŭ has been turned into u in both morgau and jhaudo. This u is not obviously any different from the one in devus, which is supposed to remain unchanged when we go back to the UTF-8 form.

You can work around this by observing that ŭ never occurs by itself—it modifies the sound of an a (pronounced ah) and becomes the vowel pair (pronounced ow, more or less). So to convert the h-system back to UTF-8 we just have to look for au and turn that into right?

Not so fast! There are some words occasionally used in Esperanto that get in the way because they naturally contain au. Some are imported like Nauro (Nauru, the country); others can be constructed. For example tra (through) + urbo (city) can form the valid expression traurba promeno (a walk through town). It would make no sense at all to write traŭrba.

The good news is that the list of words and word-fragments that are likely to contain a bare au is pretty short. I can actually list an array of them in my code and suppress the conversion back to in those cases. Currently I have 14 exceptions based on a word list I found online.

Phew. Are we done yet with ambiguities? Not quite: h is also a normal Esperanto letter and sometimes it follows a letter that could have a diacritic. Take flughaveno (airport). This is the combination of flugi (to fly) and haveno (port). The g and h should never be merged into a ĝ. Again the number of situations where this occurs is fairly small and I can hard-code it (34 exception patterns).

The second challenge is how to avoid being mind-bogglingly inefficient. Even the most basic operation of converting UTF-8 to the x-system requires looking for 12 different letters with diacritics once you include lower and upper case. It kind of sucks if you have to do 12 passes of the entire text, juggling memory reallocations along the way, and it’s only going to get worse with the h-system.

In truth computers are pretty fast. The entire corpus of Esperanto text ever written can probably be transliterated naively by my laptop in a few seconds. (Complete guess, I may try it on Vikipedio someday.) But I still felt bad and went poking around on crates.io. Surely computer science offers a pre-baked solution for replacing multiple strings efficiently?

What I found was a marvellous crate called aho-corasick, named after the algorithm it implements for doing exactly that—efficient searches (and replacements) with many search patterns simultaneously. The name was already familiar to me because it ends up in the dependency graph of many Rust programs. I’d been using it indirectly for a long time. With this library I was able to supply my entire list of h-system exceptions and it can handle all the matches on a single pass of the input data. Very cool.

I don’t expect anybody to find this useful but it was a fun afternoon.