JavaScript by Example

Strings

JavaScript strings are sequences of UTF-16 code units - template literals, surrogate pairs, and grapheme clusters each have important implications.

JavaScript strings are sequences of UTF-16 code units, not characters. For most ASCII text this distinction is invisible, but it surfaces immediately when you handle emoji, accented letters, or any character outside the Basic Multilingual Plane.

Template literals use backticks and ${expression} interpolation. They support multi-line strings and embed any expression - no concatenation required.

const name = "world";
const greeting = `Hello, ${name}!`;
console.log(greeting); // Hello, world!
 
const multiline = `Line one
Line two`;
console.log(multiline);

An emoji like "๐Ÿ˜€" is encoded as a surrogate pair - two UTF-16 code units. The .length property counts code units, not visible characters, so it returns 2 for a single emoji.

const emoji = "๐Ÿ˜€";
console.log(emoji.length); // 2 - two UTF-16 code units
 
// Spread iterates code points, not code units
console.log([...emoji].length); // 1 - one code point

String.prototype.normalize converts between Unicode normalization forms (NFC, NFD, NFKC, NFKD). Two strings that look identical can compare unequal if one uses a precomposed character and the other uses a base + combining mark.

const a = "รฉ"; // รฉ - precomposed
const b = "eฬ"; // รฉ - e + combining acute accent
 
console.log(a === b); // false
console.log(a.normalize() === b.normalize()); // true

In production

Any string you truncate for a UI label, store in a database with a character-limit column, or split on "characters" needs Intl.Segmenter or a grapheme-cluster library. A naive .slice(0, 10) on a user's display name can cut a surrogate pair in half, writing a lone surrogate to the database and producing mojibake on read-back. Family emoji (๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง) can be 8+ code units - .length will not tell you what you think.

Enjoyed this? Get more essays on software craft delivered to your inbox.

Subscribe free