Question 1

What is the difference between a Unicode code point, a character, and a glyph?

Accepted Answer

Code point: a number assigned by Unicode (e.g., U+0041). Character: the abstract meaning (the letter A). Glyph: the visual representation drawn by a font. One code point = one character (usually). One character can have many glyphs (A in different fonts looks different). One glyph can combine multiple code points (the emoji sequence 👨‍💻 is three code points rendered as one visual glyph). This is why `"👨‍💻".length === 5` in JavaScript (2 + 1 + 2 surrogate pairs) but `[..."👨‍💻"].length === 3` (3 code points).

Question 2

What is the difference between UTF-8, UTF-16, and UTF-32?

Accepted Answer

All three encode the same Unicode code points. UTF-32 uses exactly 4 bytes per code point — simple but wasteful. UTF-16 uses 2 bytes for BMP characters (U+0000–U+FFFF) and 4 bytes (surrogate pairs) for characters above U+FFFF. UTF-8 uses 1 byte for ASCII (U+0000–U+007F), 2 bytes for U+0080–U+07FF, 3 bytes for U+0800–U+FFFF, and 4 bytes for U+10000–U+10FFFF. UTF-8 is the dominant web encoding (backward-compatible with ASCII, space-efficient for Latin text).

Question 3

What are Unicode categories?

Accepted Answer

Unicode assigns every code point a general category: L (Letter: Ll lowercase, Lu uppercase, Lt titlecase, Lm modifier, Lo other), M (Mark: Mn non-spacing, Mc spacing combining, Me enclosing), N (Number: Nd decimal digit, Nl letter number, No other number), P (Punctuation), S (Symbol), Z (Separator), C (Other: Cc control, Cf format, Cs surrogate, Co private use, Cn unassigned). Regex `p{L}` matches any Unicode letter; `p{N}` any number.

Question 4

What are Unicode normalization forms?

Accepted Answer

The same visual character can have multiple Unicode representations: "é" can be U+00E9 (precomposed: é as a single code point) or U+0065 + U+0301 (decomposed: e followed by combining acute accent ́). These are canonically equivalent but byte-different. Normalization forms: NFC (canonical decomposition + canonical composition — most compact, used on the web), NFD (canonical decomposition only), NFKC and NFKD (compatibility normalization — collapses variants like ﬁ → fi). String equality checks should normalize to NFC first.

Unicode Inspector

What is it and how does it work?

Common use cases

Frequently asked questions

What is the difference between a Unicode code point, a character, and a glyph?

What is the difference between UTF-8, UTF-16, and UTF-32?

What are Unicode categories?

What are Unicode normalization forms?

Text