Demystifying Encoding and Length of String

Hi👋 How you doin'?

When I text somebody, I write something like this at the beginning.

Do you do this too sometimes?

I like to use emojis as punctuation characters, pretty much.

WEB input validation

So, what if you have to check the length of a message like that?

I know that I gotta be careful about stuff like .length, input[0] and so on,

but this time, I stepped in further into the internals.

Let's count string length in JavaScript

If you want to make sure users receive an error message when they input too many characters, you may write:

const input = "some random input";
if (input.length > 30) {
    alert("sorry, your profile must be less than 30 characters!");
}

But what exactly is the input.length counting? In the example above,

> const input = "some random input"; console.log(input.length);

17

Yeah, looks no problem. But hey, what if your user write something like:

> const input = "Hi👋 How you doin'?"; console.log(input.length);

19

> "👋".length

2

// ok, I know it's a multibyte character, that's fine ofc.
// let's try another lovely sentence
> const input = "I love my girlfriend👩‍❤️‍💋‍👨"; console.log(input.length);

31

// huh, this is getting wild...

> "👩‍❤️‍💋‍👨".length
11

// ok, fair enough...?

Hmm, OK I know something happens with emojis, but why?

Internal representation of string in JavaScript

When you write JavaScript code, I assume you set the encoding to UTF-8.

But JavaScript uses UTF-16 as its internal representation of strings.

First of all, when you use UTF-16 you have to represent a character in 2 bytes, not alaways but pretty much.

So, as you know, it's like A is U+0041, 4 digits of hex, ok it's 2 bytes.

If you want to check Unicode table, you refer to https://unicode.org/charts/

But let's think about it, how many characters can we deal with in this way?

Yes, it's 2 bytes, which means 2^16 = 65535.

Is that enough? No.

So if you want to use minor characters like 👋, you have to use Surrogates,

roughly meaning you have to treat 2 characters as 1 character.

OK, then what is the Unicode code point of 👋? Let's check at https://emojipedia.org/emoji/%F0%9F%91%8B/ (emojipedia comes in hady)

So 👋's code point is U+1F44B. (Is that codepoints btw? Help from English speakers wanted...)

As you've noticed, U+1F44B is 5 digits of hex and over 65535,

with that pointed out, in JavaScript (UTF-16):

> "👋".split("")

(2) ['\uD83D', '\uDC4B']

Voila! (?)

Hight Surrogates 'U+D83D' + Low Surrogates 'U+DC4B'

become 1 character with power, or cost, of 2 characters. What a story!

Zero Width Joiner

Next, what about 👩‍❤️‍💋‍👨? Isn't it weird that its .length is 11??

Let's unpack, or split it:

> "👩‍❤️‍💋‍👨".split("")

(11) ['\uD83D', '\uDC69', '‍', '❤', '️', '‍', '\uD83D', '\uDC8B', '‍', '\uD83D', '\uDC68']

It looks something like emoji + empty char + heart + empty char + empty char + emoji + empty char + emoji.

You want more organized information? Look here! https://emojipedia.org/emoji/%F0%9F%91%A9%E2%80%8D%E2%9D%A4%EF%B8%8F%E2%80%8D%F0%9F%92%8B%E2%80%8D%F0%9F%91%A8/

It says Codepoints U+1F469, U+200D, U+2764, U+FE0F, U+200D, U+1F48B, U+200D, U+1F468,

which is 👩 + something + ❤ + something + 💋 + something + 👨.

Kinda makes sense.

Given U+1F469 is equivalent to U+D83D + U+DC69, U+200D is the one of empty char.

What is this?

Let's take a look at https://www.compart.com/en/unicode/U+200D this time.

Ah, your name is Zero Width Joiner (ZWJ), nice to SEE you finally.

It's used to join characters. It might sound a bit weird, but you can join characters. Huh!

You can see exmaples here: https://unicode.org/emoji/charts/emoji-zwj-sequences.html

and Wiki is here if you want: https://en.wikipedia.org/wiki/Zero-width_joiner

Variation Selector

Another empty char is U+FE0F Variation Selector-16 (VS16): https://www.compart.com/en/unicode/U+FE0F

You can read its detail at Unicode org's website but, in short,

with U+2665 U+FE0E you get black heart, while with U+2665 U+FE0F you get red heart.

Now, if you write 👩‍❤️‍💋‍👨 and hit backspace key (in VSCode, I checked), what do you get and why?

> "👩‍❤️‍💋‍👨"
// then hit backspace to get
> "👩‍❤️‍💋"

// to see what you deleted, you do:
> "👩‍❤️‍💋".split("")

(8) ['\uD83D', '\uDC69', '‍', '❤', '️', '‍', '\uD83D', '\uDC8B']

Ah, I deleted the man and splitted them... Sad but now it's explainable to some extent...🫣

(You will probably get 👩‍❤️ in Chrome's devtool console)

OK, I have to dig into this but that's it for today. Cool!

I hope it was clear, and hopefully not wrong at least. (correction welcomed!🙏)

Demystifying Encoding and Length of String

Hi👋 How you doin'?

WEB input validation

Let's count string length in JavaScript

Internal representation of string in JavaScript

Zero Width Joiner

Variation Selector

Appendix: Things I want to talk about later

Grapheme Cluster

Japanese text encoding in Golang

MySQL

Hi👋 How you doin'?​

WEB input validation​

Let's count string length in JavaScript​

Internal representation of string in JavaScript​

Zero Width Joiner​

Variation Selector​

Appendix: Things I want to talk about later​

Grapheme Cluster​

Japanese text encoding in Golang​

MySQL​

Hi👋 How you doin'?

WEB input validation

Let's count string length in JavaScript

Internal representation of string in JavaScript

Zero Width Joiner

Variation Selector

Appendix: Things I want to talk about later

Grapheme Cluster

Japanese text encoding in Golang

MySQL