What is the difference between UTF-8 and Base64 encoding?

UTF-8 is a character encoding that translates characters to bytes for storage and transmission. Base64 is a binary-to-text encoding that represents binary data using 64 ASCII characters, increasing data size by about 33%. They serve different purposes: UTF-8 handles how text is stored, while Base64 converts binary data into a text-safe format.

What are HTML entities and when do I need them?

HTML entities are special sequences used to represent reserved characters in HTML content. For example, as >. You need them whenever you want to display characters that have special meaning in HTML markup, such as angle brackets, ampersands, and quotation marks. They are also essential for preventing XSS attacks by escaping user-generated content.

What is Punycode and why do international domain names use it?

Punycode is an encoding that converts Unicode strings into the limited ASCII character set allowed in DNS domain names. Internationalized Domain Names (IDNs) use the xn-- prefix to signal Punycode encoding. For example, münchen.de is encoded as xn--mnchen-3ya.de. This allows browsers to support non-ASCII domain names while staying compatible with the existing DNS infrastructure.

What causes mojibake and how can I fix it?

Mojibake (meaning 'changed characters' in Japanese) occurs when text is decoded using the wrong character encoding. For example, if UTF-8 encoded text is read as Latin-1 (ISO-8859-1), characters like é (UTF-8 bytes C3 A9) become Ã©. To fix it, ensure that the encoding declared in HTTP headers, HTML meta tags, and your text editor all match the actual encoding of the file.

Understanding Text Encoding: Base64, URL, UTF-8 & More

1. Character Sets & Unicode

A character set is the complete collection of characters that an encoding system supports. A code point is the unique number assigned to each character in that set. Understanding the difference between these two concepts is the foundation of understanding text encoding.

ASCII

ASCII (American Standard Code for Information Interchange) was published in 1963 and defines 128 characters using 7 bits. It covers the English alphabet (uppercase and lowercase), digits 0-9, common punctuation, and control characters like newline and tab. ASCII was sufficient for early computing but could not represent accented characters, non-Latin scripts, or symbols used outside the English-speaking world.

A = 65 (0x41)    a = 97 (0x61) 0 = 48 (0x30)    ! = 33 (0x21) newline = 10 (0x0A)  space = 32 (0x20)

Unicode

Unicode is the Universal Character Set. It assigns a unique code point to every character from every writing system in the world. As of Unicode 16.0, it defines over 149,000 characters across 161 scripts, plus emoji, symbols, and historical scripts. Unicode code points are written as U+ followed by hexadecimal digits:

U+0041 = Latin capital letter A
U+00E9 = Latin small letter e with acute
U+4E16 = CJK Unified Ideograph (Chinese character for "world")
U+1F600 = Grinning face emoji

Unicode organizes its code points into 17 planes of 65,536 characters each. The Basic Multilingual Plane (BMP, Plane 0) contains the most commonly used characters including almost all modern scripts. Supplementary planes hold emoji, rare historic scripts, and specialized symbols.

It is important to understand that Unicode defines which characters exist and what their code points are. It does not define how those code points are stored as bytes. That is the job of an encoding scheme like UTF-8, UTF-16, or UTF-32.

2. UTF-8 Deep Dive

UTF-8 (Unicode Transformation Format, 8-bit) is the dominant encoding on the web. Over 98% of all web pages use UTF-8. It is a variable-width encoding, meaning different characters use different numbers of bytes (1 to 4). UTF-8 is fully backward-compatible with ASCII, which means any valid ASCII file is also a valid UTF-8 file.

Variable-Length Encoding

1 byte (code points U+0000 to U+007F): ASCII characters (A-Z, a-z, 0-9, basic punctuation). The high bit is always 0.
2 bytes (U+0080 to U+07FF): Most Latin-based scripts with diacritics, Cyrillic, Arabic, and Hebrew.
3 bytes (U+0800 to U+FFFF): Most East Asian characters (Chinese, Japanese, Korean), additional symbols, and most emoji.
4 bytes (U+10000 to U+10FFFF): Rare scripts, historic characters, and supplementary-plane emoji.

Byte Order Mark (BOM)

The BOM is a special Unicode character (U+FEFF) that can appear at the start of a text stream to signal the byte order and encoding. In UTF-8, the BOM is represented as the byte sequence EF BB BF. While the BOM is unnecessary for UTF-8 (since UTF-8 has a defined byte order and is self-synchronizing), some editors (notably Windows Notepad) add it automatically. This can cause problems in scripts, configuration files, and concatenated string outputs.

Why UTF-8 Dominates the Web

UTF-8 has several advantages that make it the preferred encoding for web content, APIs, and modern applications. It is compact for English and Western European text (1 byte per character, same as ASCII), handles all Unicode characters, and is self-synchronizing, meaning you can always determine character boundaries from any position in a byte stream.

UTF-8 vs UTF-16 vs UTF-32

Property	UTF-8	UTF-16	UTF-32
Min bytes per character	1	2	4
Max bytes per character	4	4	4
ASCII compatibility	Yes	No	No
Variable width	Yes (1-4 bytes)	Yes (2-4 bytes)	No (fixed 4 bytes)
Endianness	N/A	BOM required	BOM required
Best for	Web, APIs, files	OS internals (Windows, Java)	Internal processing

3. Base64 Encoding

Base64 is a binary-to-text encoding scheme that represents binary data using a set of 64 ASCII characters (A-Z, a-z, 0-9, +, and /). It increases data size by approximately 33% because every 3 bytes of input become 4 characters of output.

How It Works

Input bytes are divided into groups of 3 bytes (24 bits).
Each 24-bit group is split into four 6-bit chunks.
Each 6-bit chunk is mapped to one of 64 ASCII characters using a lookup table.
If the input length is not a multiple of 3, padding with = characters is added (one or two = signs).

Input:  "Man" Bytes:  77 97 110 Binary: 01001101 01100001 01101110 Groups: 010011 010110 000101 101110 Index:  19      22      5       46 Base64: T       W       F       u Result: TWFu

When to Use Base64

Data URIs: Embedding small images directly in HTML or CSS using the data: URI scheme (e.g., data:image/png;base64,iVBOR...).
Email attachments: MIME encoding uses Base64 to send binary files like images and documents in email.
API authentication: HTTP Basic Auth sends credentials as base64(username:password) in the Authorization header.
JSON Web Tokens (JWT): The header and payload segments of a JWT are Base64URL-encoded.
Binary data in JSON: Since JSON cannot represent raw binary, Base64 is used to embed binary data within JSON strings.

4. URL Encoding (Percent Encoding)

URL encoding (also called percent encoding) converts characters into a format that can be safely transmitted over the Internet. URLs can only contain a limited set of ASCII characters. Any character outside this safe set must be encoded as a percent sign followed by two hexadecimal digits.

Reserved vs Unreserved Characters

Unreserved characters (A-Z, a-z, 0-9, hyphen, period, underscore, tilde) never need to be encoded in URLs. Reserved characters like /, ?, &, =, #, and % have special meaning in URL syntax and must be encoded when used as data rather than delimiters.

Space     → %20  (or + in query strings) é         → %C3%A9  (two bytes in UTF-8) &        → %26 你好       → %E4%BD%A0%E5%A5%BD  (six bytes in UTF-8)

Encoding Spaces: + vs %20

The + sign is an application/x-www-form-urlencoded convention for spaces in query strings only. In the path component of a URL, spaces must be encoded as %20. When decoding, always use decodeURIComponent() for path and parameter values, and handle + separately if dealing with form data.

When to Encode or Decode

Query parameter values that may contain special characters
Path segments with reserved or unsafe characters
Form data submitted with application/x-www-form-urlencoded
Any data being placed in a URL that is not guaranteed to be URL-safe

In JavaScript, use encodeURIComponent() for individual parameter values and encodeURI() for full URLs (which leaves characters like /, :, and ? unencoded since they are URL delimiters).

5. HTML Entities

HTML entities are special sequences used to represent characters that have reserved meaning in HTML markup, or characters that cannot be easily typed on a keyboard. They are essential for displaying special characters inside HTML content without the browser interpreting them as markup.

Named Entities

The most commonly used named entities are:

Character	Entity	Description
&	`&`	Ampersand
<	`<`	Less than (left angle bracket)
>	`>`	Greater than (right angle bracket)
"	`"`	Double quotation mark
'	`'`	Single quotation mark (apostrophe)
	` `	Non-breaking space
©	`©`	Copyright symbol
—	`—`	Em dash

Numeric Entities

Any Unicode character can be represented using numeric entities, either in decimal or hexadecimal form:

&#60;       → <    (decimal) &#x3C;      → <    (hexadecimal) &#8364;     → €  (Euro sign) &#x1F600;   → 😀    (grinning face emoji)

When Are HTML Entities Needed?

Escaping markup in content: When you need to display HTML code examples on a web page, angle brackets and ampersands must be escaped as entities.
XSS prevention: User-generated content must be entity-encoded before being inserted into HTML to prevent cross-site scripting attacks. Modern frameworks like React handle this automatically.
Special whitespace: The   entity creates a non-breaking space that prevents line breaks, useful for formatting.
Typographic characters: Named entities like — (em dash) and – (en dash) ensure correct rendering across all platforms.

6. Punycode & Internationalized Domain Names

Domain names have historically been limited to ASCII characters (letters, digits, hyphens).Punycode is an encoding scheme defined in RFC 3492 that converts Unicode strings into ASCII, enabling non-English domain names while staying compatible with the existing DNS infrastructure.

The Problem Punycode Solves

Without Punycode, a domain like münchen.de could not be registered or resolved because the ü character is not valid in DNS. Punycode encodes the Unicode characters into ASCII and prefixes the result with xn-- (the ASCII Compatible Encoding, or ACE, prefix).

münchen.de     → xn--mnchen-3ya.de 日本.jp        → xn--wgv71a.jp 中国.cn         → xn--fiqs8s.cn café.com    → xn--caf-dma.com

How IDN Works

A user types a Unicode domain name (e.g., münchen.de) in the browser address bar.
The browser converts it to Punycode (e.g., xn--mnchen-3ya.de) using the IDNA (Internationalized Domain Names in Applications) protocol.
The DNS query is sent using the ASCII Punycode representation.
The DNS server resolves the domain normally. The user sees the original Unicode domain in the browser.

Security Considerations

Punycode introduces a homograph attackrisk: visually similar characters from different scripts can be used to create phishing domains. For example, the Cyrillic letter "a" looks identical to the Latin "a", allowing an attacker to register a domain that looks legitimate. Modern browsers mitigate this by displaying the Punycode form when a domain mixes scripts from different languages.

7. Common Encoding Issues

Mojibake

Mojibake(from Japanese, meaning "changed characters") occurs when text is decoded using the wrong character encoding. The classic example is UTF-8 text being read as Latin-1 (ISO-8859-1):

Intended:  Café UTF-8 bytes: 43 61 66 C3 A9 Read as Latin-1: CafÃ©

To prevent mojibake, always ensure that the encoding declared in HTTP headers, HTML meta tags, and your database configuration all match the actual encoding of the content.

BOM Issues

A UTF-8 BOM (EF BB BF) at the start of a file can cause subtle bugs: PHP output buffering errors, XML parser failures, and invisible characters at the beginning of JSON responses. If you see unexpected behavior at the start of a file, check for a BOM using a hex editor.

Double Encoding

Double encoding happens when already-encoded text is encoded again. For example, encoding the string %20 a second time produces %2520. This is a common bug in web applications that encode user input multiple times through different processing layers.

Replacement Character (�)

The Unicode replacement character U+FFFD(�) appears when a byte sequence cannot be decoded as valid UTF-8. This typically means the data was encoded in a different format (like Windows-1252 or ISO-8859-1) but is being read as UTF-8. The only reliable fix is to re-encode the original data using the correct encoding.

8. Encoding Utilities & Resources

KnowKit provides several free browser-based encoding utilities to help you work with different formats. All processing happens client-side, so your data never leaves your browser.

Base64 Encoder/Decoder: Convert between plain text and Base64 encoding instantly.
URL Encoder/Decoder: Encode and decode URL components safely.
HTML Entities Encoder: Convert special characters to HTML entities and back.
Punycode Converter: Convert internationalized domain names between Unicode and Punycode (ACE) format.
URL Parser: Break down URLs into their components and inspect encoded segments.

Understanding text encoding is fundamental for anyone working with web technologies. Character encoding issues are among the most common and frustrating bugs in software development, but they become straightforward once you understand the underlying principles. Whether you are building APIs, handling user input, or working with international content, knowing how UTF-8, Base64, URL encoding, HTML entities, and Punycode work will save you hours of debugging.