How Character Encoding (Charset) Affects Compression and Decompression

When you compress or decompress text, the character encoding (charset) of that text plays a crucial role because compression algorithms work on raw bytes, not on human-readable characters. Any change in encoding changes the byte representation, which directly impacts the compressed output and its decompression.


Text to Bytes Conversion

Compression algorithms like ZIP, GZIP, BZip2, and LZ77 operate on byte data. This means if you are compressing text, it must first be converted to bytes using a character encoding. Common encodings include:

  • UTF-8: Variable length (1 to 4 bytes per character).
  • UTF-16: Fixed length (2 bytes for basic characters, 4 bytes for special ones).
  • ISO-8859-1: Single-byte encoding, limited to 256 characters.

Why Character Encoding Matters

The choice of encoding changes the byte representation of the text. Since compression works at the byte level, the compressed size and the compressed content are directly affected.

Example: Compressing the string "Hello, World!"
import zlib
# UTF-8 encoding
utf8_bytes = "Hello, World!".encode('utf-8')
compressed_utf8 = zlib.compress(utf8_bytes)
# UTF-16 encoding
utf16_bytes = "Hello, World!".encode('utf-16')
compressed_utf16 = zlib.compress(utf16_bytes)
print("UTF-8 Compressed Size:", len(compressed_utf8))
print("UTF-16 Compressed Size:", len(compressed_utf16))
Output
UTF-8 Compressed Size: 21
UTF-16 Compressed Size: 33

* UTF-8 is more compact in this case because it uses fewer bytes per character for standard ASCII characters.

* UTF-16 adds extra null bytes ("\x00") for each character, resulting in a larger compressed size.


Decompression and Charset Consistency

When you decompress, the raw bytes are restored, but you still need to decode those bytes back to text. The charset used for decompression must match the one used during compression.

# Decompress
decompressed_utf8 = zlib.decompress(compressed_utf8).decode('utf-8')
decompressed_utf16 = zlib.decompress(compressed_utf16).decode('utf-16')
print(decompressed_utf8)   # Output: Hello, World!
print(decompressed_utf16)  # Output: Hello, World!

If you try to use the wrong charset:

# This will raise a UnicodeDecodeError
zlib.decompress(compressed_utf16).decode('utf-8')

Why?

Because the byte structure is different:

* UTF-8: "'Hello, World!'" → "48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21"

* UTF-16: "'Hello, World!'" → "ff fe 48 00 65 00 6c 00 6c 00 6f 00 ..."


Summary

Operation Encoding Matters? Why?
Compression Yes Compression works on bytes, different charsets produce different byte streams
Decompression Yes The decompressed bytes must be decoded with the same charset used during compression