Hashing Text

Hashing algorithms work on bytes, not text. If hashing text, it must be converted to bytes first. The character encoding used for conversion changes the byte structure and therefore the hash.

How Hashing Operates on Bytes

Hashing is a process that transforms an input (message, file, or text) into a fixed-size string of bytes, typically represented as a hash value or digest. Hashing algorithms like SHA-256, SHA-1, and MD5 perform mathematical operations directly on byte data, not on text.


Text to Bytes Conversion

If you want to hash text (like a string), it first needs to be converted into a sequence of bytes because hashing algorithms work on raw byte data.

This is done using a character encoding (charset):

* UTF-8: The most common and widely used encoding.

* UTF-16: Fixed-size; used by Windows internally.

* ISO-8859-1: Single-byte; for Western European characters.


Why Character Encoding Matters

The charset you use to convert text into bytes directly affects the hash output. This is because the byte representation of a string changes with different encodings.

Example: Hashing the string "Hello"
import hashlib
# UTF-8 encoding
utf8_bytes = "Hello".encode('utf-8')
hash_utf8 = hashlib.sha256(utf8_bytes).hexdigest()
# UTF-16 encoding
utf16_bytes = "Hello".encode('utf-16')
hash_utf16 = hashlib.sha256(utf16_bytes).hexdigest()
print("UTF-8 Hash: ", hash_utf8)
print("UTF-16 Hash:", hash_utf16)
Output
UTF-8 Hash:  185f8db32271fe25f561a6fc938b2e264306ec304eda518007d1764826381969
UTF-16 Hash: 89709e7b9c4f6d357c4057fd083d65a45fe10fd119c39f31d669a9a76eb06d1e

Notice how the hash values are completely different? This is because:

* UTF-8 represents "Hello" as:

48 65 6c 6c 6f
  

* UTF-16 represents "Hello" as:

ff fe 48 00 65 00 6c 00 6c 00 6f 00
  

Hash Consistency

To get consistent hash values, use the same character encoding when converting text to bytes. If you hash "Hello" with UTF-8 today, it should always be hashed with UTF-8 for future comparisons.


Hashing Files (Already Byte Data)

When hashing files, they are already in byte format, so no encoding conversion is required:

with open('example.txt', 'rb') as f:
file_hash = hashlib.sha256(f.read()).hexdigest()