How Character Encoding of URL-Encoded HTTP Query Parameters Matters
When data is sent via HTTP query parameters (the part of a URL after "?"), it is typically URL-encoded to make it safe for transmission over the internet. This encoding converts characters into a format that can be safely included in a URL.
URL Encoding Basics
URL encoding replaces unsafe characters with a "%" followed by two hexadecimal digits representing the byte value in ASCII. For example:
- Space (
" "
) →"%20"
- Exclamation mark (
"!"
) →"%21"
- Unicode character (
"✓"
) →"%E2%9C%93"
Original: Hello World! URL Encoded: Hello%20World%21
Character Encoding Matters
The character encoding (charset) used to transform text into bytes directly affects the URL encoding:
- UTF-8: Multi-byte encoding, most common for web applications.
- ISO-8859-1 (Latin-1): Single-byte encoding, sometimes used in older systems.
- UTF-16: Rare for URLs, but possible; it creates larger URL-encoded values.
Example URL encoding the word "café"
from urllib.parse import quote # UTF-8 encoding utf8_encoded = quote("café".encode('utf-8')) print(utf8_encoded) # Output: caf%C3%A9 # ISO-8859-1 encoding iso_encoded = quote("café".encode('iso-8859-1')) print(iso_encoded) # Output: caf%E9
* UTF-8 → "café" becomes "caf%C3%A9" (two bytes for "é": "C3 A9")
* ISO-8859-1 → "café" becomes "caf%E9" (one byte for "é": "E9")
If the server expects UTF-8 but receives ISO-8859-1, it will misinterpret the bytes.
UTF-8 is the Typical Charset for Query Parameters
RFC 3986 (Uniform Resource Identifier specification) recommends UTF-8 as the standard encoding for URLs. Browsers (Chrome, Firefox, Safari) automatically use UTF-8 for Form submissions, AJAX requests, and URL parameters. APIs and RESTful services generally expect UTF-8, unless specified otherwise.
Problems if the Charset Doesn't Match
If the server expects UTF-8 but receives a different encoding then multibyte (non-us-ascii) characters may be corrupted. Symbols like "€", "✓", or non-Latin characters could be misread. For example, an API expecting "café" in UTF-8 might receive garbled text if sent in ISO-8859-1.
Specifying Character Encoding
If you need to specify a different charset, it must be declared in the Content-Type header:
Content-Type: application/x-www-form-urlencoded; charset=ISO-8859-1