Open Menu

Unigraph: Replacing Unicode with Grapheme Encoding

  • Espra introduces a new text encoding mechanism called Unigraph that fixes many of Unicode’s shortcomings.

  • Unlike Unicode which is built around codepoints, Unigraph is built around the individual “characters” that people see, i.e. graphemes.

  • By massively simplifying the underlying model, Unigraph text can be much easier to use for both users and developers.

  • A Unigraph Registry on Espra provides an open and democratic way for registering Unigraph IDs.

  • Unigraph text is encoded into bytes using UEF, the Unigraph Encoding Format derived from the design of UTF-8.

What’s Wrong with Unicode?

Unicode [Uni15] is an amazing feat of collaboration. Unlike the early days of computing when we had to deal with hundreds of conflicting text encodings, Unicode:

  • Provides a single mechanism that underlies almost all digital text systems today.

  • Supports over 160 distinct scripts that are used to represent over 2,000 languages.

  • Keeps us entertained with more and more emojis every year.

But there is a fatal flaw at the heart of Unicode. It isn’t designed around what you and I see as individual characters. Instead, it is built around what it calls codepoints.

Each codepoint is just a number assigned to a concept in a particular writing system. For some letters, Unicode assigns a unique codepoint:

Codepoint Character
U+0041 A
U+00D8 Ø
U+0646 ن
U+1F60A 😊

But since Unicode was designed in the 1980s when storage was at a premium, they tried to save space by defining most characters as a sequence of existing codepoints:

Codepoints Character
U+0061 U+0301
U+0061 U+0323 U+0302 ậ
U+05E9 U+05C1 שׁ
U+0B95 U+0BBF கி
U+1F1EC U+1F1E7 🇬🇧

This has led to a host of problems, e.g.

  • A letter like कि is formed by combining the codepoint for (U+0915, DEVANAGARI LETTER KA) with the codepoint for ि (U+093F, DEVANAGARI VOWEL SIGN I).

    What should happen when you press backspace after this letter? Some text editors will delete just the second codepoint, leaving just , whilst others will delete the whole thing.

    This is one of the major reasons why you might see the cursor jumping around when editing text in non-Western languages.

  • To make matters worse, Unicode is not consistent about whether it assigns a character a single codepoint, or if it’s composed from multiple codepoints, or even both.

    For example, the letter can be represented by composing e (U+0065, LATIN SMALL LETTER E) with ´ (U+0301, COMBINING ACUTE ACCENT).

    But Unicode also includes this letter as a single codepoint in a precomposed form, i.e. just é (U+00E9, LATIN SMALL LETTER E WITH ACUTE).

    So pressing backspace after it might sometimes result in just the accent being removed, or sometimes the whole character — even though the letter looks the same to the user!

    Allowing arbitrary elements to be combined also results in the nonsensical Zalgo text that you may have seen around the internet:

    
    
    
     z̵̧̧̻̤̭͇͚͙̺̺͈̰̠̪̮͒̓̽̈͝ͅͅa̸̦̙͙̘̦̅̊̋̄̿͋̎̀͑͒͊͘̕l̶̡̧̝͇̣̻͕̝̓̐͊g̶̘̈̀̓͑̏̅́̾́̑̐ő̴͈̯̊͌͊̿̒̄͗́́̽̂́̔̈́ ̵̨̨̣̟͉̣͓̘̰͕̺͚̺͖̟̆̍͂̚t̴̜̟̦̝̻̫͖̯͔̍̀̉̃͘͝e̶̙̫̝͔͈͐̐̀x̶̨̠͓̙̥̠̮̪͓͈͎̩̙̖͗́̽̈́̋̆͋̔̇͛͑͊͌t̴̠̐͝
    
    
    
    
  • All of this makes life hell for developers who care about supporting non-Western languages. Most programming languages can’t even count Unicode characters properly!

    For example, most people would reasonably imagine the length of the text string 🤦🏼‍♂️ to be just 1. But there’s no universal definition of this length that everyone agrees on.

    Python 3:

    len("🤦🏼‍♂️")
    5
    

    JavaScript:

    "🤦🏼‍♂️".length
    7
    

    Rust:

    "🤦🏼‍♂️".len()
    17
    

    In fact, Swift is one of the few languages that gets this right without having to resort to some unicode-specific libraries:

    "🤦🏼‍♂️".count
    1
    
  • All of this makes even basic things like finding matching text difficult:

    >>> "José" == "José"
    False
    
    >>> len("José")
    4
    
    >>> len("José")
    5
    

    This shows up as UX failures, e.g. search not showing expected results, database lookups missing records, form validation rejecting “identical” inputs, etc.

  • But even if your language had perfect Unicode support, you’d still be screwed because the Unicode Consortium changes the rules every year.

    What was considered to be 3 separate grapheme clusters today could easily be expected to be treated as a single grapheme cluster tomorrow.

    It’s mind-boggling that we allow such a foundational standard to constantly redefine how existing text can be interpreted!

  • The desire of the Unicode designers to minimize the number of codepoints has a noticeable negative impact on various cultures and languages.

    For example, Unicode rightly recognizes Gujarati and Devanagari as distinct scripts despite similarities, e.g. (U+0928, DEVANAGARI LETTER NA) vs. (U+0AA8, GUJARATI LETTER NA).

    But for the representation of Han characters in Chinese (Hanzi), Japanese (Kanji), and Korean (Hanja), the Unicode designers decided to “unify” them with the same codepoints.

    This results in the loss of regional, historical, and stylistic distinctions, and even results in unintelligble texts when the wrong font is used.

If the above wasn’t devastating enough, Unicode carries a lot of historical baggage that makes dealing with text a lot more painful than it needs to be:

  • Despite UTF-8 being the obviously superior encoding of Unicode text for decades, the Unicode Standard still defines UTF-16 and UTF-32.

    This results in complexities like surrogate pairs that are not correctly dealt with by most developers. For example, if you wanted to truncate some text, you might have a function like:

    function truncate(text, maxLength) {
      if (text.length <= maxLength) {
        return text;
      }
      return text.slice(0, maxLength) + '...';
    }
    

    But when called with the following text in JavaScript:

    truncate("Party Time 🎉🎉🎉", 12)
    

    It will give you text with the Unicode replacement character that indicates that you have invalid text:

    "Party Time �..."
    

    So, even though 🎉 is defined as a single codepoint U+1F389, since UTF-16 represents it as a surrogate pair, you end up slicing between the high and low surrogate.

    Working around all this results in lots of edge cases, e.g. the UTF-32-ish encoding in Python, using WTF-8 to deal with broken file names on Windows, etc.

  • Unicode defines not one, but four different normalizing forms: NFC, NFD, NFKC, and NFKD. Most developers don’t even know that they should normalize, never mind which form to use!

  • Case folding will happily lose information, e.g. lower casing will convert the German ß to ss but not preserve it the other way when it is title cased:

    "Straße"     "strasse"
    "strasse"    "Strasse"
    

    Even worse, since Unicode tries to save on codepoints, case folding in various languages can cause unexpected issues. For example, Turkish has 2 distinct forms of the letter I:

    "i"  "İ"
    "ı"  "I"
    

    But since Unicode decided to reuse i and I, upper casing the name of one of the most significant cities in history, will result in ISTANBUL instead of the expected İSTANBUL.

  • As a result of reusing codepoints across different languages, sorting Unicode strings will not put them into the typical dictionary order expected by most languages.

    For example, in French, é is sorted with e, but in Swedish, é comes after z. A generic mechanism like the Unicode Collation Algorithm cannot handle these conflicting expectations.

Finally, when it comes to custom codepoints that platforms might want to add, Unicode only defines certain codepoint ranges as Private Use Areas and doesn’t help to standardize them beyond that.

As a result:

  • You end up with characters like (U+F8FF) which might show the Apple logo on some platforms, or show something else altogether on other platforms.

  • Apps like Discord, Slack, and Telegram don’t even use the Private Use Area for things like custom emojis, and instead end up implementing their own custom text handling.

    This results in text that can’t even be properly copy/pasted across systems!

In short, while Unicode has been immensely beneficial to humanity, we can do better.

Grapheme-Native Unigraph Approach

Operations on Unigraph text work exactly as users would expect. For example, the length of 🤦🏼‍♂️ is always 1. Pressing backspace deletes the entire character as seen by the user.

This is because Unigraph works at the “grapheme” level, where each grapheme represents a group of one or more written symbols that together should be treated as a single element by a reader:

Grapheme
A
é
கி
🇬🇧

Unlike Unicode where each grapheme can be composed of one or more codepoints, Unigraph assigns a single ID for each grapheme, shown here with a G+ prefix:

Grapheme Unicode Codepoints Unigraph ID
3 U+0033 G+0033
कि U+0915 U+093F G+02D4
🤦🏼‍♂️ U+1F926 U+1F3FC U+200D U+2642 U+FE0F G+20CB26

This allows us to avoid many of the shortcomings of Unicode:

  • Pressing backspace will delete exactly one grapheme. Delete कि and you get empty space, not . Delete 🤦🏼‍♂️ and you get empty space, not broken emoji fragments.

    The cursor will be exactly where the user expects it to be. It’s also easy for developers to implement. They just have to remove the Unigraph ID just before the cursor location.

  • As each Unigraph ID corresponds to a “precomposed” grapheme, it’d be impossible to create nonsensical combinations like Zalgo text.

    All text is guaranteed to be meaningful and properly rendered!

  • String operations, like getting the length of a piece of text or truncating it, will behave exactly as developers expect and be consistent across all programming languages.

    For example, the Unigraph text 👩🏽‍❤️‍💋‍👨🏻 will always have length of 1 in all programming language — not 10, 15, or 35 depending on your platform.

  • Since Unigraph is already built around graphemes, future additions to Unigraph will never change the rendering of valid Unigraph content from the past.

    Once an ID is assigned for a grapheme, its meaning is fixed forever. No more worrying about Unicode updates breaking your existing text processing.

  • Instead of trying to save space by reusing the same codepoints, Unigraph assumes that storage is relatively cheap. This allows us to give each language its own set of graphemes.

    This will make life easier for lots of people, e.g. Japanese authors writing in Kanji can now quote Chinese text without having to faff around with different fonts.

    After all, it’s not like Unicode is saving much space anyway — many emojis are 35 bytes long! These will be much shorter in Unigraph.

  • As each language gets its own set of graphemes in Unigraph, operations like case folding and collation will produce the results that people expect.

    For example, since the i in the Turkish word istanbul will not be reusing the same i from English, istanbul would be properly uppercased to İSTANBUL in Turkish.

    Likewise, a distinct Swedish é separate from the French one, will allow it to be sorted after z in Swedish text. All cultures can be first-class citizens under Unigraph!

  • In terms of normalization, simply lowercasing text will be enough for most use cases. As case folding will map graphemes within the same language, this will produce the expected results.

    But for use cases like search, a normalize_homoglyphs function will be provided that will normalize all graphemes to a “base” one if they look visually identical.

    This will take an intensity_level of low, medium, and high, e.g. to map across scripts like Cyrillic and Latin, to map across grapheme categories, e.g. 0 → O.

  • Since anyone can add their own grapheme to Unigraph by going through its open and democratic process, there will be no need for Private Use Areas.

    Apple can add their logo to Unigraph and expect it to render on all systems. All apps can support “custom” emojis that can be copy/pasted across platforms.

In short, by using graphemes as the base unit, Unigraph can provide a significantly better experience for both users and developers. A writing system fit for thousands of years.

The Unigraph Registry

The assignment of IDs to graphemes is done democratically and openly through the Unigraph Registry. Each ID is an arbitrarily sized integer and is assigned sequentially.

  • While there is no upper bound on the maximum possible ID, a soft_cap will be specified with every Unigraph version so that implementations can use more efficient data structures.

  • The initial soft_cap will be set to 0x7FFFFFFF so that IDs can fit within a uint32 with 1 bit to spare, e.g. to enable compilers to perform niche optimizations within enums, etc.

  • This initial soft_cap allows values to be encoded within 6 bytes while still supporting over 2 billion potential IDs.

  • The soft_cap will be increased over time as needed and implementations will be free to choose whatever makes sense for their platform.

The first 128 IDs will correspond to the characters in ASCII. Just like it did for Unicode, this will massively reduce friction in getting adoption for Unigraph.

For IDs beyond the first 128, anyone will be able to propose a new one by paying a small fee. This fee is there to:

  • Prevent abuse and spamming of the Registry.

  • Incentivize people to only add entities that are valuable to themselves or others.

Each proposal will need to define certain metadata that will be associated with the Unigraph ID, e.g.

  • Name of the character, e.g. TAMIL LETTER KI.

  • Grapheme Collection ID, e.g. Russian, Tamil, Alice’s Emojis, etc.

    • Each of these could theoretically correspond to a distinct “keyboard” on devices like mobile phones.
  • Default vector representation of the character in either TinyVG or, for animated emojis, Animated TinyVG format.

    • Having this definition will ensure that there’s always a fallback so that graphemes can be rendered even if specific fonts aren’t available — no more boxes!

    • There will be a size limit on these to ensure the Unigraph Database doesn’t get too bloated even if thousands of custom emojis get added.

  • General category, e.g. letter, number, emoji, etc.

  • Bidirectional category, e.g. left to right, right to left, etc.

  • Mappings for case folding and normalization — if applicable.

  • Relative ordering, when applicable, for collation purposes within the Grapheme Collection.

Each metadata field will use values that have been defined in secondary Unigraph Registries, e.g. Collection Registry, Category Registry, etc.

Once proposed, the new addition will be voted on by those who have previously contributed proposals that have been accepted. If a majority votes in favour, then the proposal will be accepted.

Every quarter, all updates to the Unigraph Registry will be packaged up for use by downstream services similar to the existing Olson/IANA Time Zone Database.

While entries added to the Unigraph Registry will never be deleted, a “renewal” fee will be required to keep entries alive over time:

  • This will remove the associated vector representation from being included in the quarterly snapshots, and thus minimize bloat from flash-in-the-pan meme emojis, etc.

  • Renderers will, of course, still be able to fetch the vector representation at runtime to ensure that legacy text still gets rendered properly.

Composition Elements

When composing graphemes in some scripts, IMEs (Input Method Editors) like keyboards often display composition elements, e.g. a diacritic like an acute accent, a particular stroke in Chinese, etc.

These composition elements are defined with their own IDs in a secondary Unigraph registry so that platforms can have consistent rendering and encodings of partial compositions during typing.

Unlike Unicode, these composition elements are not valid in Unigraph text, and are instead handled by a custom editable_text data type where they are custom encoded differently.

This allows for cross-platform IME consistency, while ensuring that you don’t get the craziness of arbitrary compositions like Zalgo text in Unicode.

UEF Encoding

Unlike Unicode’s multiple encoding formats (UTF-8, UTF-16, UTF-32) which create complexity and compatibility issues, Unigraph has just one format for encoding text: UEF (Unigraph Encoding Format).

UEF specifies how text, i.e. a sequence of Unigraph IDs, should be converted into bytes. It builds on many of the fantastic ideas behind UTF-8:

  • It is a variable-length encoding so that frequently used characters can use a lot less space, while not placing any arbitrary limit on the number of possible characters.

  • All ASCII characters are encoded identically to ASCII — preserving 100% compatibility with legacy outputs and helping to bootstrap adoption.

  • The first byte of every encoded ID is either 0xxxxxx (ASCII) or 10xxxxxx followed by 11xxxxxx continuation bytes — enabling decoders to find their place, even if halfway through a stream.

  • The byte representations of the Unigraph IDs sort in the exact same order as the IDs themselves — making it easy to manipulate text for purposes like binary search, indexing, etc.

  • UEF encoded data will never contain the bytes 0xBE and 0xBF, so they can be used to safely delimit Unigraph text within other systems without any escapes or length prefixes.

  • No encoding of a Unigraph ID will be a subset of the encoding of any other ID — making it easy to search for text in byte streams without having to decode everything first.

UEF only differs from UTF-8 for the following reasons:

  • To support both Unigraph text, made up of just Unigraph IDs, as well as editable_text which has composition element IDs interspersed amongst Unigraph IDs.

  • To go beyond the max encodable value of 0x10FFFF on UTF-8 so that we can encode any Unigraph ID value without being constrained by an upper limit.

  • To ensure that there is only one possible encoding of any Unigraph ID. This will avoid the security issues caused by UTF-8 decoders not properly validating “overly long” encodings.

    • UEF achieves this by subtracting the upper bound encoded by smaller byte ranges in every multi-byte encoding — ensuring there’s only one way to encode any ID.

    • For example, in 2-byte encodings UEF will subtract the 128 values already encoded within single bytes. Thus, overlong encodings will be impossible.

  • To simplify the implementation by not having to deal with edge cases like surrogate pairs, noncharacters, etc.

  • To make it easy to distinguish text written in UEF from UTF-8.

For encoding characters beyond ASCII, UEF takes a multi-byte approach similar to UTF-8:

  • That is, the first byte uses some bits as a “multi-byte indicator” and is followed by one or more continuation bytes that together encode the related value.

  • Multi-byte encodings of Unigraph IDs encode values from 128 upwards, while the multi-byte encodings of composition element IDs encode from 0 upwards.

  • The multi-byte indicator and continuation byte patterns are the opposites of each other in UEF and UTF-8:

    Byte Type UEF UTF-8
    Multi-byte Indicator 10xxxxxx 11xxxxxx
    Continuation Byte 11xxxxxx 10xxxxxx
  • The multi-byte indicator in UEF is comprised of:

    • Fixed bits like 0 and 1.

    • x bits which, together with the x bits in continuation bytes, represent the bits for the value being encoded, i.e. a Unigraph ID or composition element ID, in big endian order.

    • An optional c bit, which if set to 0 indicates that the value being encoded is a Unigraph ID, or if set to 1 indicates that it is a composition element ID.

      When the c bit is not present, then the value is always a Unigraph ID.

    • s bits which indicate the number of continuation bytes.

  • Multi-byte indicators for 2-4 byte encodings follow this pattern: 10ssxxxx, where the s bits indicate the total number of bytes:

    Bit Pattern Total Number of Bytes
    00 2
    01 3
    10 4
  • Multi-byte indicators for 5-7 byte encodings follow this pattern: 1011ssxx, where the s bits indicate the total number of bytes:

    Bit Pattern Total Number of Bytes
    00 5
    01 6
    10 7
  • Multi-byte indicators for 8-byte encodings follow this pattern: 1011110c, where the c bit indicates whether the continuation bytes encode a composition element ID or Unigraph ID.

    Since this is the first time that a c bit is present, composition element IDs occupy a minimum of 8 bytes. We believe this is fair as those elements will be relatively rare.

  • If, in an 8-byte encoding, all the continuation bytes are “filled”, i.e. are 11111111, then UEF switches into “extended” mode. The specification of this is left to the future, as:

    • At over 5 trillion Unigraph IDs and over 4 trillion composition element IDs, we’re unlikely to hit that limit anytime soon.

    • We may, at that point, want to include metadata bits for things like the character class and bidirectionality directly into the ID itself so as to minimize the need for lookups.

    The simplest specification for the extended mode would be that it signals an additional continuation byte of the format 10ssssss where the s bits indicate the number of extra bytes.

    This can then be indefinitely extended with the same mechanism, i.e. if the 10ssssss is 10111111 and all 64 continuation bytes are filled, i.e. 11111111.

    Although, after the first extension, we’d be looking at more characters than there are atoms in the currently observable universe, so it’d be rather unlikely. But one never knows.

To compare UEF against UTF-8, we can extend UTF-8 to 8-bytes based on its original design by Ken Thompson and Rob Pike:

Codepoint Range Codepoint Binary UTF-8 Bytes
0x00 to 0x7F 0xxxxxxx 0xxxxxxx
0x80 to 0x7FF 00000yyy xxxxxxxx 110yyyxx 10xxxxxx
0x800 to 0xFFFF yyyyyyyy xxxxxxxx 1110yyyy 10yyyyxx 10xxxxxx
0x10000 to 0x1FFFFF 000zzzzz yyyyyyyy xxxxxxxx 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
0x200000 to 0x3FFFFFF 000000ww zzzzzzzz yyyyyyyy xxxxxxxx 111110ww 10zzzzzz 10zzyyyy 10yyyyxx 10xxxxxx
0x04000000 to 0x7FFFFFFF 0wwwwwww zzzzzzzz yyyyyyyy xxxxxxxx 1111110w 10wwwwww 10zzzzzz 10zzyyyy 10yyyyxx 10xxxxxx
0x80000000 to 0xFFFFFFFFF 0000vvvv wwwwwwww zzzzzzzz yyyyyyyy xxxxxxxx 11111110 10vvvvww 10wwwwww 10zzzzzz 10zzyyyy 10yyyyxx 10xxxxxx
0x1000000000 to 0x3FFFFFFFFFF 000000uu vvvvvvvv wwwwwwww zzzzzzzz yyyyyyyy xxxxxxxx 11111111 10uuvvvv 10vvvvww 10wwwwww 10zzzzzz 10zzyyyy 10yyyyxx 10xxxxxx

And compare it to UEF encodings for the same number of bytes:

ID Range UEF Bytes
0x00 to 0x7F 0xxxxxxx
0x80 to 0x047F 1000yyxx 11xxxxxx
0x480 to 0x1047F 1001yyyy 11yyyyxx 11xxxxxx
0x10480 to 0x41047F 1010zzzz 11zzyyyy 11yyyyxx 11xxxxxx
0x410480 to 0x441047F 101100ww 11zzzzzz 11zzyyyy 11yyyyxx 11xxxxxx
0x4410480 to 0x10041047F 101101ww 11wwwwww 11zzzzzz 11zzyyyy 11yyyyxx 11xxxxxx
0x100410480 to 0x1010041047F 1011110vv 11vvvvww 11wwwwww 11zzzzzz 11zzyyyy 11yyyyxx 11xxxxxx
0x10100410480 to 0x5010041047E 1011110c 11uuvvvv 11vvvvww 11wwwwww 11zzzzzz 11zzyyyy 11yyyyxx 11xxxxxx

We can see that while UEF is not as space efficient for 2-byte encodings, it’s better for 3 bytes and above — with almost double the capacity at the 4-byte sweet spot:

Byte Count UEF Characters UTF-8 Characters
1 128 128
2 1,152 2,048
3 66,688 65,536
4 4,260,992 2,097,152
5 71,369,856 67,108,864
6 4,299,228,288 2,147,483,648
7 1,103,810,856,064 68,719,476,736
8 5,501,857,367,166 Unigraph IDs + 4,398,046,511,102 Composition Element IDs 4,398,046,511,103

This also gives us easily distinguishable multi-byte indicator bytes across UEF and UTF-8:

Number of Bytes UEF Prefix UTF-8 Prefix
2 1000xxxx 110xxxxx
3 1001xxxx 1110xxxx
4 1010xxxx 11110xxx

Compared to a rough implementation of a UTF-8 encoder:

func EncodeUTF8(codepoints []rune) ([]byte, error) {
	buf := make([]byte, 0, len(codepoints))
	for i, cp := range codepoints {
		if cp < 0 || cp > 0x10FFFF {
			return nil, fmt.Errorf("unicode: codepoint U+%04X at position %d is outside of the Unicode range", cp, i)
		}
		if cp < 0x80 {
			buf = append(buf, byte(cp))
		} else if cp < 0x800 {
			buf = append(buf, 0xC0|byte(cp>>6), 0x80|byte(cp&0x3F))
		} else if cp < 0x10000 {
			if cp >= 0xD800 && cp <= 0xDFFF {
				return nil, fmt.Errorf("unicode: surrogate codepoint U+%04X found at position %d", cp, i)
			}
			buf = append(buf, 0xE0|byte(cp>>12), 0x80|byte((cp>>6)&0x3F), 0x80|byte(cp&0x3F))
		} else {
			buf = append(buf, 0xF0|byte(cp>>18), 0x80|byte((cp>>12)&0x3F), 0x80|byte((cp>>6)&0x3F), 0x80|byte(cp&0x3F))
		}
	}
	return buf, nil
}

The UEF version is pretty similar, despite supporting a broader range and making it impossible to produce overlong encodings:

func EncodeUEF(ids []uint32) ([]byte, error) {
	buf := make([]byte, 0, len(ids))
	for i, id := range ids {
		if id < 0x80 {
			buf = append(buf, byte(id))
		} else if id < 0x480 {
			id -= 0x80
			buf = append(buf, 0x80|byte(id>>6), 0xC0|byte(id&0x3F))
		} else if id < 0x10480 {
			id -= 0x480
			buf = append(buf, 0x90|byte(id>>12), 0xC0|byte((id>>6)&0x3F), 0xC0|byte(id&0x3F))
		} else if id < 0x410480 {
			id -= 0x10480
			buf = append(buf, 0xA0|byte(id>>18), 0xC0|byte((id>>12)&0x3F), 0xC0|byte((id>>6)&0x3F), 0xC0|byte(id&0x3F))
		} else if id < 0x4410480 {
			id -= 0x410480
			buf = append(buf, 0xB0|byte(id>>24), 0xC0|byte((id>>18)&0x3F), 0xC0|byte((id>>12)&0x3F), 0xC0|byte((id>>6)&0x3F), 0xC0|byte(id&0x3F))
		} else if id <= 0x7FFFFFFF {
			id -= 0x4410480
			buf = append(buf, 0xB4|byte(id>>30), 0xC0|byte((id>>24)&0x3F), 0xC0|byte((id>>18)&0x3F), 0xC0|byte((id>>12)&0x3F), 0xC0|byte((id>>6)&0x3F), 0xC0|byte(id&0x3F))
		} else {
			return nil, fmt.Errorf("unigraph: ID G+%X at position %d exceeds the current soft_cap of 0x7FFFFFFF", id, i)
		}
	}
	return buf, nil
}

Likewise, compared to a rough implementation of a UTF-8 decoder:

func DecodeUTF8(buf []byte) ([]rune, error) {
	var runes []rune
	i := 0
	for i < len(buf) {
		b := buf[i]
		if b < 0x80 {
			runes = append(runes, rune(b))
			i++
		} else if b < 0xC0 {
			return nil, fmt.Errorf("unicode: unexpected continuation byte 0x%02X at position %d", b, i)
		} else if b < 0xE0 {
			if i+1 >= len(buf) {
				return nil, fmt.Errorf("unicode: incomplete 2-byte sequence at position %d", i)
			}
			b2 := buf[i+1]
			if b2&0xC0 != 0x80 {
				return nil, fmt.Errorf("unicode: invalid continuation byte 0x%02X at position %d", b2, i+1)
			}
			cp := rune(b&0x1F)<<6 | rune(b2&0x3F)
			if cp < 0x80 {
				return nil, fmt.Errorf("unicode: overlong encoding at position %d", i)
			}
			runes = append(runes, cp)
			i += 2
		} else if b < 0xF0 {
			if i+2 >= len(buf) {
				return nil, fmt.Errorf("unicode: incomplete 3-byte sequence at position %d", i)
			}
			b2, b3 := buf[i+1], buf[i+2]
			if b2&0xC0 != 0x80 || b3&0xC0 != 0x80 {
				return nil, fmt.Errorf("unicode: invalid continuation byte in 3-byte sequence at position %d", i)
			}
			cp := rune(b&0x0F)<<12 | rune(b2&0x3F)<<6 | rune(b3&0x3F)
			if cp < 0x800 {
				return nil, fmt.Errorf("unicode: overlong encoding at position %d", i)
			}
			if cp >= 0xD800 && cp <= 0xDFFF {
				return nil, fmt.Errorf("unicode: surrogate codepoint U+%04X at position %d", cp, i)
			}
			if (cp >= 0xFDD0 && cp <= 0xFDEF) || cp == 0xFFFE || cp == 0xFFFF {
				return nil, fmt.Errorf("unicode: noncharacter U+%04X at position %d", cp, i)
			}
			runes = append(runes, cp)
			i += 3
		} else if b < 0xF8 {
			if i+3 >= len(buf) {
				return nil, fmt.Errorf("unicode: incomplete 4-byte sequence at position %d", i)
			}
			b2, b3, b4 := buf[i+1], buf[i+2], buf[i+3]
			if b2&0xC0 != 0x80 || b3&0xC0 != 0x80 || b4&0xC0 != 0x80 {
				return nil, fmt.Errorf("unicode: invalid continuation byte in 4-byte sequence at position %d", i)
			}
			cp := rune(b&0x07)<<18 | rune(b2&0x3F)<<12 | rune(b3&0x3F)<<6 | rune(b4&0x3F)
			if cp < 0x10000 {
				return nil, fmt.Errorf("unicode: overlong encoding at position %d", i)
			}
			if cp > 0x10FFFF {
				return nil, fmt.Errorf("unicode: codepoint U+%06X exceeds Unicode range at position %d", cp, i)
			}
			if (cp & 0xFFFE) == 0xFFFE {
				return nil, fmt.Errorf("unicode: noncharacter U+%06X at position %d", cp, i)
			}
			runes = append(runes, cp)
			i += 4
		} else {
			return nil, fmt.Errorf("unicode: invalid start byte 0x%02X at position %d", b, i)
		}
	}
	return runes, nil
}

The UEF version is almost the same length, despite supporting a broader range of values, as it doesn’t need to check for surrogate pairs, noncharacters, etc.

func DecodeUEF(buf []byte) ([]uint32, error) {
	var ids []uint32
	i := 0
	for i < len(buf) {
		b := buf[i]
		if b < 0x80 {
			ids = append(ids, uint32(b))
			i++
		} else if b < 0x90 {
			if i+1 >= len(buf) {
				return nil, fmt.Errorf("unigraph: incomplete 2-byte sequence at position %d", i)
			}
			b2 := buf[i+1]
			if b2&0xC0 != 0xC0 {
				return nil, fmt.Errorf("unigraph: invalid continuation byte 0x%02X at position %d", b2, i+1)
			}
			id := (uint32(b&0x0F)<<6 | uint32(b2&0x3F)) + 0x80
			ids = append(ids, id)
			i += 2
		} else if b < 0xA0 {
			if i+2 >= len(buf) {
				return nil, fmt.Errorf("unigraph: incomplete 3-byte sequence at position %d", i)
			}
			b2, b3 := buf[i+1], buf[i+2]
			if b2&0xC0 != 0xC0 || b3&0xC0 != 0xC0 {
				return nil, fmt.Errorf("unigraph: invalid continuation byte in 3-byte sequence at position %d", i)
			}
			id := (uint32(b&0x0F)<<12 | uint32(b2&0x3F)<<6 | uint32(b3&0x3F)) + 0x480
			ids = append(ids, id)
			i += 3
		} else if b < 0xB0 {
			if i+3 >= len(buf) {
				return nil, fmt.Errorf("unigraph: incomplete 4-byte sequence at position %d", i)
			}
			b2, b3, b4 := buf[i+1], buf[i+2], buf[i+3]
			if b2&0xC0 != 0xC0 || b3&0xC0 != 0xC0 || b4&0xC0 != 0xC0 {
				return nil, fmt.Errorf("unigraph: invalid continuation byte in 4-byte sequence at position %d", i)
			}
			id := (uint32(b&0x0F)<<18 | uint32(b2&0x3F)<<12 | uint32(b3&0x3F)<<6 | uint32(b4&0x3F)) + 0x10480
			ids = append(ids, id)
			i += 4
		} else if b < 0xB4 {
			if i+4 >= len(buf) {
				return nil, fmt.Errorf("unigraph: incomplete 5-byte sequence at position %d", i)
			}
			b2, b3, b4, b5 := buf[i+1], buf[i+2], buf[i+3], buf[i+4]
			if b2&0xC0 != 0xC0 || b3&0xC0 != 0xC0 || b4&0xC0 != 0xC0 || b5&0xC0 != 0xC0 {
				return nil, fmt.Errorf("unigraph: invalid continuation byte in 5-byte sequence at position %d", i)
			}
			id := (uint32(b&0x03)<<24 | uint32(b2&0x3F)<<18 | uint32(b3&0x3F)<<12 | uint32(b4&0x3F)<<6 | uint32(b5&0x3F)) + 0x410480
			ids = append(ids, id)
			i += 5
		} else if b < 0xB8 {
			if i+5 >= len(buf) {
				return nil, fmt.Errorf("unigraph: incomplete 6-byte sequence at position %d", i)
			}
			b2, b3, b4, b5, b6 := buf[i+1], buf[i+2], buf[i+3], buf[i+4], buf[i+5]
			if b2&0xC0 != 0xC0 || b3&0xC0 != 0xC0 || b4&0xC0 != 0xC0 || b5&0xC0 != 0xC0 || b6&0xC0 != 0xC0 {
				return nil, fmt.Errorf("unigraph: invalid continuation byte in 6-byte sequence at position %d", i)
			}
			id := (uint32(b&0x03)<<30 | uint32(b2&0x3F)<<24 | uint32(b3&0x3F)<<18 | uint32(b4&0x3F)<<12 | uint32(b5&0x3F)<<6 | uint32(b6&0x3F)) + 0x4410480
			if id > 0x7FFFFFFF {
				return nil, fmt.Errorf("unigraph: decoded ID G+%X at position %d exceeds the current soft_cap of 0x7FFFFFFF", id, i)
			}
			ids = append(ids, id)
			i += 6
		} else {
			return nil, fmt.Errorf("unigraph: invalid start byte 0x%02X at position %d, possibly exceeds current soft_cap of 0x7FFFFFFF", b, i)
		}
	}
	return ids, nil
}

Thus, we expect the performance of UEF to be comparable to UTF-8, if not slightly faster, particularly for operations like parsing, validation, iteration, length calculation, etc.

Thoughts on Adoption

To get Unigraph off the ground, we’ll need to first port over characters from Unicode:

  • We can take advantage of the existing corpus of data on the Web to help us assign precomposed forms of commonly used characters to lower Unigraph IDs.

  • We can work with communities to establish commonly agreed upon mechanisms for things like case folding and collation order for popular scripts.

  • We can provide libraries in various languages to convert between Unicode and Unigraph text, using the replacement character when a matching character doesn’t exist.

We expect Espra to be the primary driver of Unigraph adoption. This will be helped by the string type in Iroh, Espra’s programming language, which will be based entirely on top of Unigraph.

Outside of Espra, we expect relatively minimal adoption for a long time as Unicode is extremely well embedded, e.g. in file names, native apps, web standards, etc.

Over time, Unigraph should roughly follow the same trajectory as UTF-8 over the last 30 years, i.e. where usage was close to 0% in 1995 but is now used by over 98% of websites.

References

Uni15 Unicode Consortium. The Unicode Standard. Version 15.1, 2024. External link

To cite:

@misc{tav2025unigraph,
  title={Unigraph: Replacing Unicode with Grapheme Encoding},
  author={Tav},
  year={2025},
  primaryClass={cs.CR}
}