Base64, URL-encoding, and friends: when each one is right
A working tour of the eight encodings developers hit weekly. Base64 vs Base64URL, percent-encoding vs form-encoding, HTML entities vs JSON escapes, quoted-printable vs hex. Which one is right for which job — and the bugs that come from mixing them up.
There are at least eight encodings a typical developer hits in a normal week — Base64, Base64URL, percent-encoding, HTML entities, JSON string escapes, quoted-printable, hex, URL form-encoding — and they all do approximately the same thing (turn arbitrary bytes into safe characters) for completely different reasons. Mixing them up is the source of about half of all "why is this string broken" bug reports.
This post is a working tour of which encoding is right for which job, what each one trades off, and the failure modes that show up when you reach for the wrong one.
The underlying problem every encoding solves
Computer systems frequently need to transport arbitrary binary data through channels designed for text. The channel might be a URL (where slashes, question marks, and ampersands have special meaning), an email body (where some bytes can't appear), a JSON string (where quotes and backslashes need escaping), or an HTML attribute (where angle brackets and ampersands are reserved).
Every encoding picks a "safe" alphabet and a mapping from input bytes to characters in that alphabet. The differences come down to:
- What characters are considered safe? URL-safe vs. ASCII-safe vs. printable-7-bit-safe vs. JSON-string-safe.
- Is the encoding reversible byte-for-byte? All of these are — but only for properly-encoded input.
- How much overhead does it add? Base64 is ~33% bigger. Percent-encoding can be up to 200% bigger.
- Is the result fixed-size or variable? Most are variable; some have padding rules to make them fixed-block.
Base64: arbitrary bytes → ASCII letters and digits
Base64 takes binary input and produces a string using 64 characters: A-Z, a-z, 0-9, and two extras (+ and / in standard Base64). Plus = for padding.
The algorithm: take 3 bytes (24 bits) and split into 4 groups of 6 bits. Each 6-bit group indexes into the 64-character alphabet. Pad with = when the input length isn't a multiple of 3.
// JavaScript
btoa('Hello, world!') // "SGVsbG8sIHdvcmxkIQ=="
atob('SGVsbG8sIHdvcmxkIQ==') // "Hello, world!"
// btoa/atob only work for Latin-1 strings. For Unicode:
function utf8Base64(s) {
return btoa(unescape(encodeURIComponent(s)));
}
function base64Utf8(s) {
return decodeURIComponent(escape(atob(s)));
}
utf8Base64('héllo 🌍') // "aMOpbGxvIPCfjIQ="
// In Node.js
Buffer.from('Hello, world!').toString('base64')
// "SGVsbG8sIHdvcmxkIQ=="
// In Python
import base64
base64.b64encode(b'Hello, world!').decode()
// "SGVsbG8sIHdvcmxkIQ=="
Use Base64 for: embedding binary data in JSON, encoding images for data URLs, transferring opaque tokens (API keys, session IDs), email attachments. Don't use Base64 for: URLs (use Base64URL instead), anything where size matters (a 33% bloat is significant for large payloads).
Base64URL: the URL-safe variant
Standard Base64 uses + and /, both of which have special meaning in URLs. Base64URL (RFC 4648 §5) substitutes - for + and _ for /, and typically omits the = padding:
// Standard Base64
"SGVsbG8/IHdvcmxkKw=="
// ^ query string starts here when parsed as URL!
// Base64URL
"SGVsbG8_IHdvcmxkKw"
// safe to drop into a URL path or query value as-is
// Converting between them
function toBase64Url(b64) {
return b64.replace(/\+/g, '-').replace(/\//g, '_').replace(/=+$/, '');
}
function fromBase64Url(b64u) {
const pad = (4 - (b64u.length % 4)) % 4;
return b64u.replace(/-/g, '+').replace(/_/g, '/') + '='.repeat(pad);
}
Use Base64URL for: JWTs (the entire JWT format is Base64URL of the header/payload/signature), URL-embedded tokens, query parameter values that contain binary data. The JWT spec mandates Base64URL specifically because the values appear in HTTP headers and URLs.
Percent-encoding: %xx for unsafe characters in URLs
Percent-encoding (RFC 3986) is the URL-specific encoding. It replaces unsafe characters with a % followed by two hex digits representing the byte. Safe characters (A-Z, a-z, 0-9, and a handful of others) pass through unchanged.
// JavaScript: two different functions for different scopes
encodeURI('https://example.com/path with spaces?q=hello&world')
// "https://example.com/path%20with%20spaces?q=hello&world"
// preserves URL structure (?, &, /, =)
encodeURIComponent('hello&world?q=42')
// "hello%26world%3Fq%3D42"
// encodes everything that has URL meaning
// Use encodeURI for whole URLs (rare).
// Use encodeURIComponent for the PARTS of a URL (the common case).
// In Node, modern API:
new URL('https://example.com')
const u = new URL('https://example.com');
u.searchParams.set('q', 'hello&world?q=42');
u.toString();
// "https://example.com/?q=hello%26world%3Fq%3D42"
The biggest pitfall with percent-encoding is when to apply it. Encoding twice produces garbage (% becomes %25 the second time). Common bug: server-side code that re-encodes already-encoded query params, producing %2520 where %20 was intended.
Use percent-encoding for: anywhere in a URL — path segments, query parameter names and values, fragment identifiers. Use the standard library function (don't roll your own). For query strings, prefer URLSearchParams (browser) or your language's equivalent — it handles the encoding automatically.
application/x-www-form-urlencoded: similar but different
Form data submitted with default HTML form encoding uses a slightly different convention: spaces become + instead of %20. Other rules are similar to percent-encoding but not identical.
// Same input, two outputs:
encodeURIComponent('hello world')
// "hello%20world"
new URLSearchParams({ q: 'hello world' }).toString()
// "q=hello+world"
// Decoding a real form-urlencoded body:
const body = 'name=Jean+Dupont&city=Saint-%C3%89tienne';
const params = new URLSearchParams(body);
params.get('name') // "Jean Dupont" (note: + decoded to space)
params.get('city') // "Saint-Étienne" (UTF-8 percent-decoded)
This +-for-space distinction is the source of countless bugs when developers manually URL-encode data and then send it as a form body, or vice versa. Use URLSearchParams (or your language's form-encoding library) for form bodies; use encodeURIComponent for URL components.
HTML entity encoding
HTML reserves five characters: <, >, &, ", '. To include them as literal text in HTML, escape them as named or numeric entities:
// Minimum safe encoding for HTML text content
function escHtml(s) {
return s
.replace(/&/g, '&')
.replace(//g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
escHtml('<script>alert("xss")</script>')
// "<script>alert("xss")</script>"
// Note: order matters. & MUST be escaped first, or you'll double-escape:
// "&" → "&" → "&" (wrong)
Use HTML entity encoding for: any user-controlled content rendered into HTML. This is the primary defense against XSS in server-rendered HTML. Don't roll your own — every framework has a battle-tested version (React JSX, EJS <%= %>, Handlebars {{ }} all escape by default).
The dangerous escape hatches: v-html (Vue), dangerouslySetInnerHTML (React), {{{ }}} (Handlebars), <%- %> (EJS). These bypass escaping. Use them only with content from a sanitizer (DOMPurify in JavaScript), never with raw user input.
JSON string escapes
Inside JSON string values, certain characters must be escaped: \", \\, \n, \r, \t, \b, \f, plus arbitrary Unicode via \uXXXX:
JSON.stringify('She said "hello"')
// '"She said \\"hello\\""'
JSON.stringify('first line\nsecond line')
// '"first line\\nsecond line"'
JSON.stringify({ greeting: 'héllo 🌍' })
// '{"greeting":"héllo 🌍"}'
// (modern JSON.stringify defaults to outputting UTF-8 directly)
JSON.parse('"\\u00e9"') // "é"
JSON.parse('"\\n"') // newline character
You should never have to hand-write JSON escape sequences. Use JSON.stringify; use your language's JSON library. Common bug: trying to "manually" build JSON by concatenating strings, then forgetting to escape backslashes or quotes inside values.
Quoted-printable: email's compromise
SMTP was historically limited to 7-bit ASCII, with constraints on line length. Quoted-printable encoding (RFC 2045 §6.7) keeps printable ASCII characters as-is and encodes everything else as =XX hex sequences:
// "héllo wörld" in quoted-printable "h=C3=A9llo w=C3=B6rld" // Line continuation: lines longer than 76 chars end with "=" "This is a very long line that exceeds the seventy-six character limit and = gets a soft line break."
Quoted-printable is human-readable for mostly-ASCII text (unlike Base64). It's the default email Content-Transfer-Encoding for non-7-bit text. You probably won't generate it manually, but you'll see it when inspecting raw email — see our email headers analyzer.
Hex encoding
The simplest encoding: each byte becomes two hex digits. Twice the size, but easy to read and easy to transcribe:
// JavaScript
Buffer.from('Hello').toString('hex')
// "48656c6c6f"
Buffer.from('48656c6c6f', 'hex').toString()
// "Hello"
// Browser:
[...new TextEncoder().encode('Hello')].map(b => b.toString(16).padStart(2, '0')).join('')
// "48656c6c6f"
Use hex for: cryptographic hashes (SHA-256 outputs are conventionally hex), color codes, byte-level debugging, anywhere small size doesn't matter and human readability does. Avoid hex for large payloads — Base64 is more compact.
Choosing the right one
| Where the data lives | Use this encoding |
|---|---|
| JSON property value (binary data) | Base64 |
| URL path or query | Percent-encoding (encodeURIComponent) |
| application/x-www-form-urlencoded body | URLSearchParams |
| JWT header/payload | Base64URL |
| HTML text content | HTML entity escape (framework default) |
| HTML attribute value | HTML entity escape including quotes |
| JSON string value | JSON escape (JSON.stringify) |
| Email body (text part) | Quoted-printable |
| Email body (binary attachment) | Base64 |
| Cryptographic hashes | Hex |
| Color codes | Hex |
Data URL (data: scheme) | Base64 (or percent-encoding for text) |
The patterns that cause bugs
- Double-encoding. Encoding twice doesn't decode back.
%2520isn't a space; it's a literal%20. Check what your framework is doing before you also encode. - Mixing percent-encoding with form-encoding. Spaces are
%20in one,+in the other. URLSearchParams treats them interchangeably when reading but emits+when writing. - Forgetting Base64URL. Putting standard Base64 in a URL or HTTP header can cause silent breakage when
/or+means something to the receiver. - Hand-rolling HTML escape. Forgetting to escape
&first, or missing one of the five reserved characters. Use a library. - Hand-rolling JSON. Forgetting to escape backslashes inside strings. Use
JSON.stringify. - Confusing character encoding (UTF-8) with byte encoding (Base64). They're different layers. UTF-8 turns characters into bytes; Base64 turns bytes into characters. To Base64-encode a Unicode string, you first UTF-8-encode it to bytes, then Base64 those bytes.
The general rule: use your standard library, never hand-roll an encoder. The standard libraries handle all the edge cases (line breaks, padding, character sets, escape sequences) correctly. Custom encoding code is a CVE waiting to happen.