Python by Example

Strings

Python str is an immutable sequence of Unicode code points, not bytes. Encoding, f-strings, and grapheme-aware slicing.

A Python str is an immutable sequence of Unicode code points - not bytes. The separate bytes type holds raw binary data. Converting between them requires an explicit .encode() call (str to bytes) or .decode() call (bytes to str), and you must name the encoding each time.

f-strings (formatted string literals) embed expressions directly inside {}. They support Python's full format-spec syntax for alignment, padding, and number formatting - the same spec used by format() and str.format().

name = "Ada"
f"Hello, {name}!"         # 'Hello, Ada!'
 
value = 3.14159
f"{value:.2f}"            # '3.14'  (2 decimal places)
 
n = 42
f"{n:>10}"                # '        42'  (right-aligned in 10 chars)
f"{n:0>10}"               # '0000000042'  (zero-padded)

str and bytes are separate types and Python never converts between them automatically. .encode(encoding) produces bytes; .decode(encoding) produces str. If the bytes do not match the encoding you name, Python raises UnicodeDecodeError - pass errors="replace" to substitute replacement characters instead of crashing.

s = "héllo"
b = s.encode("utf-8")       # b'h\xc3\xa9llo'  (5 chars -> 6 bytes)
 
b.decode("utf-8")            # 'héllo'
 
# Wrong codec raises UnicodeDecodeError
b.decode("ascii")            # UnicodeDecodeError: invalid start byte
 
# Safety valve: replace undecodable bytes with the replacement character
b.decode("ascii", errors="replace")   # 'h��llo'

len(s) counts Unicode code points, not characters as humans perceive them (graphemes). A single visible character can be multiple code points when combining marks are used - for example, the letter a followed by a combining accent. Naive slicing can split a grapheme in the middle. Normalising with unicodedata.normalize("NFC", s) folds as many combining sequences as possible into single composed code points before slicing or comparing.

import unicodedata
 
# 'a' + combining acute accent = two code points, one visible character
composed = "á"          # á  (precomposed, 1 code point)
decomposed = ""       # a + combining acute, 2 code points
 
len(composed)                # 1
len(decomposed)              # 2
 
# Normalize to NFC before comparing or slicing user input
unicodedata.normalize("NFC", decomposed) == composed   # True

In production

f-strings evaluate arbitrary expressions inside {...} - never format untrusted input as if it were a template string; use str.format_map with a restricted mapping, or better, keep user input as data rather than embedding it in format strings. len(s) counts code points, not graphemes - naive slicing of user input with emoji or combining marks produces mojibake; normalise with unicodedata.normalize("NFC", s) or reach for the regex package before truncating display strings.

Enjoyed this? Get more essays on software craft delivered to your inbox.

Subscribe free