Skip to main content

Demystifying Encoding and Length of String

Β· 4 min read
ymdarake

HiπŸ‘‹ How you doin'?​

When I text somebody, I write something like this at the beginning.

Do you do this too sometimes?

I like to use emojis as punctuation characters, pretty much.

WEB input validation​

So, what if you have to check the length of a message like that?

I know that I gotta be careful about stuff like .length, input[0] and so on,

but this time, I stepped in further into the internals.

Let's count string length in JavaScript​

If you want to make sure users receive an error message when they input too many characters, you may write:

const input = "some random input";
if (input.length > 30) {
alert("sorry, your profile must be less than 30 characters!");
}

But what exactly is the input.length counting? In the example above,

> const input = "some random input"; console.log(input.length);

17

Yeah, looks no problem. But hey, what if your user write something like:

> const input = "HiπŸ‘‹ How you doin'?"; console.log(input.length);

19

> "πŸ‘‹".length

2

// ok, I know it's a multibyte character, that's fine ofc.
// let's try another lovely sentence
> const input = "I love my girlfriendπŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘¨"; console.log(input.length);

31

// huh, this is getting wild...

> "πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘¨".length
11

// ok, fair enough...?

Hmm, OK I know something happens with emojis, but why?

Internal representation of string in JavaScript​

When you write JavaScript code, I assume you set the encoding to UTF-8.

But JavaScript uses UTF-16 as its internal representation of strings.

First of all, when you use UTF-16 you have to represent a character in 2 bytes, not alaways but pretty much.

So, as you know, it's like A is U+0041, 4 digits of hex, ok it's 2 bytes.

If you want to check Unicode table, you refer to https://unicode.org/charts/

But let's think about it, how many characters can we deal with in this way?

Yes, it's 2 bytes, which means 2^16 = 65535.

Is that enough? No.

So if you want to use minor characters like πŸ‘‹, you have to use Surrogates,

roughly meaning you have to treat 2 characters as 1 character.

OK, then what is the Unicode code point of πŸ‘‹? Let's check at https://emojipedia.org/emoji/%F0%9F%91%8B/ (emojipedia comes in hady)

So πŸ‘‹'s code point is U+1F44B. (Is that codepoints btw? Help from English speakers wanted...)

As you've noticed, U+1F44B is 5 digits of hex and over 65535,

with that pointed out, in JavaScript (UTF-16):

> "πŸ‘‹".split("")

(2)Β ['\uD83D', '\uDC4B']

Voila! (?)

Hight Surrogates 'U+D83D' + Low Surrogates 'U+DC4B'

become 1 character with power, or cost, of 2 characters. What a story!

Zero Width Joiner​

Next, what about πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘¨? Isn't it weird that its .length is 11??

Let's unpack, or split it:

> "πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘¨".split("")

(11)Β ['\uD83D', '\uDC69', '‍', '❀', '️', '‍', '\uD83D', '\uDC8B', '‍', '\uD83D', '\uDC68']

It looks something like emoji + empty char + heart + empty char + empty char + emoji + empty char + emoji.

You want more organized information? Look here! https://emojipedia.org/emoji/%F0%9F%91%A9%E2%80%8D%E2%9D%A4%EF%B8%8F%E2%80%8D%F0%9F%92%8B%E2%80%8D%F0%9F%91%A8/

It says Codepoints U+1F469, U+200D, U+2764, U+FE0F, U+200D, U+1F48B, U+200D, U+1F468,

which is πŸ‘© + something + ❀ + something + πŸ’‹ + something + πŸ‘¨.

Kinda makes sense.

Given U+1F469 is equivalent to U+D83D + U+DC69, U+200D is the one of empty char.

What is this?

Let's take a look at https://www.compart.com/en/unicode/U+200D this time.

Ah, your name is Zero Width Joiner (ZWJ), nice to SEE you finally.

It's used to join characters. It might sound a bit weird, but you can join characters. Huh!

You can see exmaples here: https://unicode.org/emoji/charts/emoji-zwj-sequences.html

and Wiki is here if you want: https://en.wikipedia.org/wiki/Zero-width_joiner

Variation Selector​

Another empty char is U+FE0F Variation Selector-16 (VS16): https://www.compart.com/en/unicode/U+FE0F

You can read its detail at Unicode org's website but, in short,

with U+2665 U+FE0E you get black heart, while with U+2665 U+FE0F you get red heart.

Now, if you write πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘¨ and hit backspace key (in VSCode, I checked), what do you get and why?

> "πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘¨"
// then hit backspace to get
> "πŸ‘©β€β€οΈβ€πŸ’‹"

// to see what you deleted, you do:
> "πŸ‘©β€β€οΈβ€πŸ’‹".split("")

(8)Β ['\uD83D', '\uDC69', '‍', '❀', '️', '‍', '\uD83D', '\uDC8B']

Ah, I deleted the man and splitted them... Sad but now it's explainable to some extent...🫣

(You will probably get πŸ‘©β€β€οΈ in Chrome's devtool console)

OK, I have to dig into this but that's it for today. Cool!

I hope it was clear, and hopefully not wrong at least. (correction welcomed!πŸ™)

Appendix: Things I want to talk about later​

Grapheme Cluster​

Japanese text encoding in Golang​

MySQL​

  • Well known "Sushi-Beer" problem, is the utf8mb4_bin a silver bullet?

To Be Continued...