Hiπ How you doin'?β
When I text somebody, I write something like this at the beginning.
Do you do this too sometimes?
I like to use emojis as punctuation characters, pretty much.
WEB input validationβ
So, what if you have to check the length of a message like that?
I know that I gotta be careful about stuff like .length
, input[0]
and so on,
but this time, I stepped in further into the internals.
Let's count string length in JavaScriptβ
If you want to make sure users receive an error message when they input too many characters, you may write:
const input = "some random input";
if (input.length > 30) {
alert("sorry, your profile must be less than 30 characters!");
}
But what exactly is the input.length
counting? In the example above,
> const input = "some random input"; console.log(input.length);
17
Yeah, looks no problem. But hey, what if your user write something like:
> const input = "Hiπ How you doin'?"; console.log(input.length);
19
> "π".length
2
// ok, I know it's a multibyte character, that's fine ofc.
// let's try another lovely sentence
> const input = "I love my girlfriendπ©ββ€οΈβπβπ¨"; console.log(input.length);
31
// huh, this is getting wild...
> "π©ββ€οΈβπβπ¨".length
11
// ok, fair enough...?
Hmm, OK I know something happens with emojis, but why?
Internal representation of string in JavaScriptβ
When you write JavaScript code, I assume you set the encoding to UTF-8
.
But JavaScript uses UTF-16
as its internal representation of strings.
First of all, when you use UTF-16
you have to represent a character in 2 bytes, not alaways but pretty much.
So, as you know, it's like A is U+0041
, 4 digits of hex, ok it's 2 bytes.
If you want to check Unicode table, you refer to https://unicode.org/charts/
But let's think about it, how many characters can we deal with in this way?
Yes, it's 2 bytes, which means 2^16 = 65535.
Is that enough? No.
So if you want to use minor characters like π, you have to use Surrogates
,
roughly meaning you have to treat 2 characters as 1 character
.
OK, then what is the Unicode code point of π? Let's check at https://emojipedia.org/emoji/%F0%9F%91%8B/ (emojipedia comes in hady)
So π's code point is U+1F44B
. (Is that codepoints
btw? Help from English speakers wanted...)
As you've noticed, U+1F44B
is 5 digits of hex and over 65535,
with that pointed out, in JavaScript (UTF-16):
> "π".split("")
(2)Β ['\uD83D', '\uDC4B']
Voila! (?)
Hight Surrogates 'U+D83D' + Low Surrogates 'U+DC4B'
become 1 character with power, or cost, of 2 characters. What a story!
Zero Width Joinerβ
Next, what about π©ββ€οΈβπβπ¨? Isn't it weird that its .length
is 11??
Let's unpack, or split it:
> "π©ββ€οΈβπβπ¨".split("")
(11)Β ['\uD83D', '\uDC69', 'β', 'β€', 'οΈ', 'β', '\uD83D', '\uDC8B', 'β', '\uD83D', '\uDC68']
It looks something like emoji + empty char + heart + empty char + empty char + emoji + empty char + emoji
.
You want more organized information? Look here! https://emojipedia.org/emoji/%F0%9F%91%A9%E2%80%8D%E2%9D%A4%EF%B8%8F%E2%80%8D%F0%9F%92%8B%E2%80%8D%F0%9F%91%A8/
It says Codepoints U+1F469, U+200D, U+2764, U+FE0F, U+200D, U+1F48B, U+200D, U+1F468
,
which is π© + something + β€ + something + π + something + π¨.
Kinda makes sense.
Given U+1F469
is equivalent to U+D83D + U+DC69
, U+200D
is the one of empty char.
What is this?
Let's take a look at https://www.compart.com/en/unicode/U+200D this time.
Ah, your name is Zero Width Joiner (ZWJ)
, nice to SEE you finally.
It's used to join characters. It might sound a bit weird, but you can join characters. Huh!
You can see exmaples here: https://unicode.org/emoji/charts/emoji-zwj-sequences.html
and Wiki is here if you want: https://en.wikipedia.org/wiki/Zero-width_joiner
Variation Selectorβ
Another empty char is U+FE0F Variation Selector-16 (VS16)
: https://www.compart.com/en/unicode/U+FE0F
You can read its detail at Unicode org's website but, in short,
with U+2665 U+FE0E
you get black heart, while with U+2665 U+FE0F
you get red heart.
Now, if you write π©ββ€οΈβπβπ¨ and hit backspace key (in VSCode, I checked), what do you get and why?
> "π©ββ€οΈβπβπ¨"
// then hit backspace to get
> "π©ββ€οΈβπ"
// to see what you deleted, you do:
> "π©ββ€οΈβπ".split("")
(8)Β ['\uD83D', '\uDC69', 'β', 'β€', 'οΈ', 'β', '\uD83D', '\uDC8B']
Ah, I deleted the man and splitted them... Sad but now it's explainable to some extent...π«£
(You will probably get π©ββ€οΈ in Chrome's devtool console)
OK, I have to dig into this but that's it for today. Cool!
I hope it was clear, and hopefully not wrong at least. (correction welcomed!π)
Appendix: Things I want to talk about laterβ
Grapheme Clusterβ
Japanese text encoding in Golangβ
- https://cs.opensource.google/go/x/text/+/master:encoding/japanese/maketables.go
- https://encoding.spec.whatwg.org/#shift_jis
- https://encoding.spec.whatwg.org/index-jis0208.txt
MySQLβ
- Well known "Sushi-Beer" problem, is the
utf8mb4_bin
a silver bullet?
To Be Continued...