Inspect each character in text to see its Unicode codepoint and HTML entity
Unicode is a universal character encoding standard that assigns a unique code point to every character across all writing systems — over 149,000 characters as of Unicode 15.1, covering 161 scripts plus symbols, emoji, and control characters. A Unicode code point is written as U+followed by 4-6 hexadecimal digits: U+0041 is "A", U+00E9 is "é", U+1F600 is 😀, U+4E2D is "中". Code points are organised into 17 planes of 65,536 code points each — the Basic Multilingual Plane (BMP, U+0000–U+FFFF) covers most modern scripts; supplementary planes cover historic scripts, emoji, and CJK extension characters.
This tool inspects text at the Unicode level: showing the code point, official name, Unicode category, script assignment, block, and UTF-8/UTF-16 byte sequences for every character. This is invaluable for debugging encoding issues, understanding why text renders differently in different fonts, identifying invisible characters, and exploring the Unicode standard.
Code point: a number assigned by Unicode (e.g., U+0041). Character: the abstract meaning (the letter A). Glyph: the visual representation drawn by a font. One code point = one character (usually). One character can have many glyphs (A in different fonts looks different). One glyph can combine multiple code points (the emoji sequence 👨💻 is three code points rendered as one visual glyph). This is why `"👨💻".length === 5` in JavaScript (2 + 1 + 2 surrogate pairs) but `[..."👨💻"].length === 3` (3 code points).
All three encode the same Unicode code points. UTF-32 uses exactly 4 bytes per code point — simple but wasteful. UTF-16 uses 2 bytes for BMP characters (U+0000–U+FFFF) and 4 bytes (surrogate pairs) for characters above U+FFFF. UTF-8 uses 1 byte for ASCII (U+0000–U+007F), 2 bytes for U+0080–U+07FF, 3 bytes for U+0800–U+FFFF, and 4 bytes for U+10000–U+10FFFF. UTF-8 is the dominant web encoding (backward-compatible with ASCII, space-efficient for Latin text).
Unicode assigns every code point a general category: L (Letter: Ll lowercase, Lu uppercase, Lt titlecase, Lm modifier, Lo other), M (Mark: Mn non-spacing, Mc spacing combining, Me enclosing), N (Number: Nd decimal digit, Nl letter number, No other number), P (Punctuation), S (Symbol), Z (Separator), C (Other: Cc control, Cf format, Cs surrogate, Co private use, Cn unassigned). Regex `p{L}` matches any Unicode letter; `p{N}` any number.
The same visual character can have multiple Unicode representations: "é" can be U+00E9 (precomposed: é as a single code point) or U+0065 + U+0301 (decomposed: e followed by combining acute accent ́). These are canonically equivalent but byte-different. Normalization forms: NFC (canonical decomposition + canonical composition — most compact, used on the web), NFD (canonical decomposition only), NFKC and NFKD (compatibility normalization — collapses variants like fi → fi). String equality checks should normalize to NFC first.
Uppercase / Lowercase · Word Counter · Character Counter · Lorem Ipsum Generator · Remove Extra Spaces · Sort Text Lines