Understanding Text Encoding: Base64, URL, UTF-8 & More
Every piece of text you see on a computer is encoded as numbers. This guide explains how character encoding works, the differences between UTF-8, Base64, URL encoding, HTML entities, and Punycode, and when to use each one.
1. Character Sets & Unicode
A character set is the complete collection of characters that an encoding system supports. A code point is the unique number assigned to each character in that set. Understanding the difference between these two concepts is the foundation of understanding text encoding.
ASCII
ASCII (American Standard Code for Information Interchange) was published in 1963 and defines 128 characters using 7 bits. It covers the English alphabet (uppercase and lowercase), digits 0-9, common punctuation, and control characters like newline and tab. ASCII was sufficient for early computing but could not represent accented characters, non-Latin scripts, or symbols used outside the English-speaking world.
A = 65 (0x41) a = 97 (0x61) 0 = 48 (0x30) ! = 33 (0x21) newline = 10 (0x0A) space = 32 (0x20)Unicode
Unicode is the Universal Character Set. It assigns a unique code point to every character from every writing system in the world. As of Unicode 16.0, it defines over 149,000 characters across 161 scripts, plus emoji, symbols, and historical scripts. Unicode code points are written as U+ followed by hexadecimal digits:
U+0041= Latin capital letter AU+00E9= Latin small letter e with acuteU+4E16= CJK Unified Ideograph (Chinese character for "world")U+1F600= Grinning face emoji
Unicode organizes its code points into 17 planes of 65,536 characters each. The Basic Multilingual Plane (BMP, Plane 0) contains the most commonly used characters including almost all modern scripts. Supplementary planes hold emoji, rare historic scripts, and specialized symbols.
It is important to understand that Unicode defines which characters exist and what their code points are. It does not define how those code points are stored as bytes. That is the job of an encoding scheme like UTF-8, UTF-16, or UTF-32.
2. UTF-8 Deep Dive
UTF-8 (Unicode Transformation Format, 8-bit) is the dominant encoding on the web. Over 98% of all web pages use UTF-8. It is a variable-width encoding, meaning different characters use different numbers of bytes (1 to 4). UTF-8 is fully backward-compatible with ASCII, which means any valid ASCII file is also a valid UTF-8 file.
Variable-Length Encoding
- 1 byte (code points U+0000 to U+007F): ASCII characters (A-Z, a-z, 0-9, basic punctuation). The high bit is always 0.
- 2 bytes (U+0080 to U+07FF): Most Latin-based scripts with diacritics, Cyrillic, Arabic, and Hebrew.
- 3 bytes (U+0800 to U+FFFF): Most East Asian characters (Chinese, Japanese, Korean), additional symbols, and most emoji.
- 4 bytes (U+10000 to U+10FFFF): Rare scripts, historic characters, and supplementary-plane emoji.
Byte Order Mark (BOM)
The BOM is a special Unicode character (U+FEFF) that can appear at the start of a text stream to signal the byte order and encoding. In UTF-8, the BOM is represented as the byte sequence EF BB BF. While the BOM is unnecessary for UTF-8 (since UTF-8 has a defined byte order and is self-synchronizing), some editors (notably Windows Notepad) add it automatically. This can cause problems in scripts, configuration files, and concatenated string outputs.
Why UTF-8 Dominates the Web
UTF-8 has several advantages that make it the preferred encoding for web content, APIs, and modern applications. It is compact for English and Western European text (1 byte per character, same as ASCII), handles all Unicode characters, and is self-synchronizing, meaning you can always determine character boundaries from any position in a byte stream.
UTF-8 vs UTF-16 vs UTF-32
| Property | UTF-8 | UTF-16 | UTF-32 |
|---|---|---|---|
| Min bytes per character | 1 | 2 | 4 |
| Max bytes per character | 4 | 4 | 4 |
| ASCII compatibility | Yes | No | No |
| Variable width | Yes (1-4 bytes) | Yes (2-4 bytes) | No (fixed 4 bytes) |
| Endianness | N/A | BOM required | BOM required |
| Best for | Web, APIs, files | OS internals (Windows, Java) | Internal processing |
3. Base64 Encoding
Base64 is a binary-to-text encoding scheme that represents binary data using a set of 64 ASCII characters (A-Z, a-z, 0-9, +, and /). It increases data size by approximately 33% because every 3 bytes of input become 4 characters of output.
How It Works
- Input bytes are divided into groups of 3 bytes (24 bits).
- Each 24-bit group is split into four 6-bit chunks.
- Each 6-bit chunk is mapped to one of 64 ASCII characters using a lookup table.
- If the input length is not a multiple of 3, padding with
=characters is added (one or two=signs).
Input: "Man" Bytes: 77 97 110 Binary: 01001101 01100001 01101110 Groups: 010011 010110 000101 101110 Index: 19 22 5 46 Base64: T W F u Result: TWFuWhen to Use Base64
- Data URIs: Embedding small images directly in HTML or CSS using the
data:URI scheme (e.g.,data:image/png;base64,iVBOR...). - Email attachments: MIME encoding uses Base64 to send binary files like images and documents in email.
- API authentication: HTTP Basic Auth sends credentials as
base64(username:password)in theAuthorizationheader. - JSON Web Tokens (JWT): The header and payload segments of a JWT are Base64URL-encoded.
- Binary data in JSON: Since JSON cannot represent raw binary, Base64 is used to embed binary data within JSON strings.
4. URL Encoding (Percent Encoding)
URL encoding (also called percent encoding) converts characters into a format that can be safely transmitted over the Internet. URLs can only contain a limited set of ASCII characters. Any character outside this safe set must be encoded as a percent sign followed by two hexadecimal digits.
Reserved vs Unreserved Characters
Unreserved characters (A-Z, a-z, 0-9, hyphen, period, underscore, tilde) never need to be encoded in URLs. Reserved characters like /, ?, &, =, #, and % have special meaning in URL syntax and must be encoded when used as data rather than delimiters.
Space → %20 (or + in query strings) é → %C3%A9 (two bytes in UTF-8) & → %26 你好 → %E4%BD%A0%E5%A5%BD (six bytes in UTF-8)Encoding Spaces: + vs %20
The + sign is an application/x-www-form-urlencoded convention for spaces in query strings only. In the path component of a URL, spaces must be encoded as %20. When decoding, always use decodeURIComponent() for path and parameter values, and handle + separately if dealing with form data.
When to Encode or Decode
- Query parameter values that may contain special characters
- Path segments with reserved or unsafe characters
- Form data submitted with
application/x-www-form-urlencoded - Any data being placed in a URL that is not guaranteed to be URL-safe
In JavaScript, use encodeURIComponent() for individual parameter values and encodeURI() for full URLs (which leaves characters like /, :, and ? unencoded since they are URL delimiters).
5. HTML Entities
HTML entities are special sequences used to represent characters that have reserved meaning in HTML markup, or characters that cannot be easily typed on a keyboard. They are essential for displaying special characters inside HTML content without the browser interpreting them as markup.
Named Entities
The most commonly used named entities are:
| Character | Entity | Description |
|---|---|---|
| & | & | Ampersand |
| < | < | Less than (left angle bracket) |
| > | > | Greater than (right angle bracket) |
| " | " | Double quotation mark |
| ' | ' | Single quotation mark (apostrophe) |
| Non-breaking space | |
| © | © | Copyright symbol |
| — | — | Em dash |
Numeric Entities
Any Unicode character can be represented using numeric entities, either in decimal or hexadecimal form:
< → < (decimal) < → < (hexadecimal) € → € (Euro sign) 😀 → 😀 (grinning face emoji)When Are HTML Entities Needed?
- Escaping markup in content: When you need to display HTML code examples on a web page, angle brackets and ampersands must be escaped as entities.
- XSS prevention: User-generated content must be entity-encoded before being inserted into HTML to prevent cross-site scripting attacks. Modern frameworks like React handle this automatically.
- Special whitespace: The
entity creates a non-breaking space that prevents line breaks, useful for formatting. - Typographic characters: Named entities like
—(em dash) and–(en dash) ensure correct rendering across all platforms.
6. Punycode & Internationalized Domain Names
Domain names have historically been limited to ASCII characters (letters, digits, hyphens).Punycode is an encoding scheme defined in RFC 3492 that converts Unicode strings into ASCII, enabling non-English domain names while staying compatible with the existing DNS infrastructure.
The Problem Punycode Solves
Without Punycode, a domain like münchen.de could not be registered or resolved because the ü character is not valid in DNS. Punycode encodes the Unicode characters into ASCII and prefixes the result with xn-- (the ASCII Compatible Encoding, or ACE, prefix).
münchen.de → xn--mnchen-3ya.de 日本.jp → xn--wgv71a.jp 中国.cn → xn--fiqs8s.cn café.com → xn--caf-dma.comHow IDN Works
- A user types a Unicode domain name (e.g.,
münchen.de) in the browser address bar. - The browser converts it to Punycode (e.g.,
xn--mnchen-3ya.de) using the IDNA (Internationalized Domain Names in Applications) protocol. - The DNS query is sent using the ASCII Punycode representation.
- The DNS server resolves the domain normally. The user sees the original Unicode domain in the browser.
Security Considerations
Punycode introduces a homograph attackrisk: visually similar characters from different scripts can be used to create phishing domains. For example, the Cyrillic letter "a" looks identical to the Latin "a", allowing an attacker to register a domain that looks legitimate. Modern browsers mitigate this by displaying the Punycode form when a domain mixes scripts from different languages.
7. Common Encoding Issues
Mojibake
Mojibake(from Japanese, meaning "changed characters") occurs when text is decoded using the wrong character encoding. The classic example is UTF-8 text being read as Latin-1 (ISO-8859-1):
Intended: Café UTF-8 bytes: 43 61 66 C3 A9 Read as Latin-1: CaféTo prevent mojibake, always ensure that the encoding declared in HTTP headers, HTML meta tags, and your database configuration all match the actual encoding of the content.
BOM Issues
A UTF-8 BOM (EF BB BF) at the start of a file can cause subtle bugs: PHP output buffering errors, XML parser failures, and invisible characters at the beginning of JSON responses. If you see unexpected behavior at the start of a file, check for a BOM using a hex editor.
Double Encoding
Double encoding happens when already-encoded text is encoded again. For example, encoding the string %20 a second time produces %2520. This is a common bug in web applications that encode user input multiple times through different processing layers.
Replacement Character (�)
The Unicode replacement character U+FFFD(�) appears when a byte sequence cannot be decoded as valid UTF-8. This typically means the data was encoded in a different format (like Windows-1252 or ISO-8859-1) but is being read as UTF-8. The only reliable fix is to re-encode the original data using the correct encoding.
8. Encoding Utilities & Resources
KnowKit provides several free browser-based encoding utilities to help you work with different formats. All processing happens client-side, so your data never leaves your browser.
- Base64 Encoder/Decoder: Convert between plain text and Base64 encoding instantly.
- URL Encoder/Decoder: Encode and decode URL components safely.
- HTML Entities Encoder: Convert special characters to HTML entities and back.
- Punycode Converter: Convert internationalized domain names between Unicode and Punycode (ACE) format.
- URL Parser: Break down URLs into their components and inspect encoded segments.
Understanding text encoding is fundamental for anyone working with web technologies. Character encoding issues are among the most common and frustrating bugs in software development, but they become straightforward once you understand the underlying principles. Whether you are building APIs, handling user input, or working with international content, knowing how UTF-8, Base64, URL encoding, HTML entities, and Punycode work will save you hours of debugging.
Nelson
Developer and creator of KnowKit. Building browser-based tools since 2024.