Unicode Inspector

Inspect each character in text to see its Unicode codepoint and HTML entity

What is it and how does it work?

Unicode is a universal character encoding standard that assigns a unique code point to every character across all writing systems — over 149,000 characters as of Unicode 15.1, covering 161 scripts plus symbols, emoji, and control characters. A Unicode code point is written as U+followed by 4-6 hexadecimal digits: U+0041 is "A", U+00E9 is "é", U+1F600 is 😀, U+4E2D is "中". Code points are organised into 17 planes of 65,536 code points each — the Basic Multilingual Plane (BMP, U+0000–U+FFFF) covers most modern scripts; supplementary planes cover historic scripts, emoji, and CJK extension characters.

This tool inspects text at the Unicode level: showing the code point, official name, Unicode category, script assignment, block, and UTF-8/UTF-16 byte sequences for every character. This is invaluable for debugging encoding issues, understanding why text renders differently in different fonts, identifying invisible characters, and exploring the Unicode standard.

Common use cases

Frequently asked questions

What is the difference between a Unicode code point, a character, and a glyph?

Code point: a number assigned by Unicode (e.g., U+0041). Character: the abstract meaning (the letter A). Glyph: the visual representation drawn by a font. One code point = one character (usually). One character can have many glyphs (A in different fonts looks different). One glyph can combine multiple code points (the emoji sequence 👨‍💻 is three code points rendered as one visual glyph). This is why `"👨‍💻".length === 5` in JavaScript (2 + 1 + 2 surrogate pairs) but `[..."👨‍💻"].length === 3` (3 code points).

What is the difference between UTF-8, UTF-16, and UTF-32?

All three encode the same Unicode code points. UTF-32 uses exactly 4 bytes per code point — simple but wasteful. UTF-16 uses 2 bytes for BMP characters (U+0000–U+FFFF) and 4 bytes (surrogate pairs) for characters above U+FFFF. UTF-8 uses 1 byte for ASCII (U+0000–U+007F), 2 bytes for U+0080–U+07FF, 3 bytes for U+0800–U+FFFF, and 4 bytes for U+10000–U+10FFFF. UTF-8 is the dominant web encoding (backward-compatible with ASCII, space-efficient for Latin text).

What are Unicode categories?

Unicode assigns every code point a general category: L (Letter: Ll lowercase, Lu uppercase, Lt titlecase, Lm modifier, Lo other), M (Mark: Mn non-spacing, Mc spacing combining, Me enclosing), N (Number: Nd decimal digit, Nl letter number, No other number), P (Punctuation), S (Symbol), Z (Separator), C (Other: Cc control, Cf format, Cs surrogate, Co private use, Cn unassigned). Regex `p{L}` matches any Unicode letter; `p{N}` any number.

What are Unicode normalization forms?

The same visual character can have multiple Unicode representations: "é" can be U+00E9 (precomposed: é as a single code point) or U+0065 + U+0301 (decomposed: e followed by combining acute accent ́). These are canonically equivalent but byte-different. Normalization forms: NFC (canonical decomposition + canonical composition — most compact, used on the web), NFD (canonical decomposition only), NFKC and NFKD (compatibility normalization — collapses variants like fi → fi). String equality checks should normalize to NFC first.

Text

Uppercase / Lowercase · Word Counter · Character Counter · Lorem Ipsum Generator · Remove Extra Spaces · Sort Text Lines