January 15, 2008

Non-US-ASCII Characters in Filenames within a Zip Archive

The Zip file format specification can be found at this URL:
http://www.pkware.com/documents/casestudies/APPNOTE.TXT

It has this to say about the charset used for the characters of filenames stored within a .zip:

The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.

Specifically, it is the OEM code page for the locale of the computer that is typically used when storing filenames within a .zip. For the USA and Western Europe, this is code page 437. Note, the OEM code page is not the same as the ANSI code page.

Here is a list of OEM code pages:

        ' Here is a list of OEM code pages:
        ' 437 OEM - United States
        ' 737 OEM - Greek (formerly 437G)
        ' 775 OEM - Baltic
        ' 850 OEM - Multilingual Latin I
        ' 852 OEM - Latin II
        ' 855 OEM - Cyrillic (primarily Russian)
        ' 857 OEM - Turkish
        ' 858 OEM - Multlingual Latin I + Euro symbol
        ' 860 OEM - Portuguese
        ' 861 OEM - Icelandic
        ' 862 OEM - Hebrew
        ' 863 OEM - Canadian - French
        ' 864 OEM - Arabic
        ' 865 OEM - Nordic
        ' 866 OEM - Russian
        ' 869 OEM - Modern Greek
        ' 874 ANSI/OEM - Thai (same as 28605, ISO 8859-15)
        ' 932 ANSI/OEM - Japanese, Shift-JIS
        ' 936 ANSI/OEM - Simplified Chinese (PRC, Singapore)
        ' 949 ANSI/OEM - Korean (Unified Hangeul Code)
        ' 950 ANSI/OEM - Traditional Chinese (Taiwan; Hong Kong SAR, PRC)

If a .zip is written using some other code page, then you should set the Zip.OemCodePage property to its value. For example, one Chilkat customer tried to open a .zip where the filenames were saved using Windows-1252. The problem is that Chilkat is interpreting the bytes of the filename according to code page 437, but the bytes are really Windows-1252. For example, examine code page 437 here: Character Chart for Code Page 437. Also, examine the chart for Windows-1252 here: Character Chart for Windows-1252.

The German umlaut for lowercase ‘u’ in Window-1252 has the byte value 0xFC (decimal 252), but in code page 437, the same character is 0×81 (decimal 129).

NOTE: The Chilkat .NET 2.0 pre-release (http://www.chilkatsoft.com/preRelease.asp) has been updated so that if byte values in the range of decimal 166-223, or 239-255 are found, and the code page is 437, the component will instead assume that it should really use code page 1252. The reason is that these characters are highly unlikely to be part of a filename.

A .zip should use code page 437 for Western-European characters. If it does not, the bytes are interpreted incorrectly. Chilkat Zip allows you to remedy the situation by explicitly telling the component what code page to use for filenames — and that’s the purpose of the OemCodePage property.

So, if your .zip was created using Windows-1252 encoding for filenames, set the OemCodePage prior to opening the .zip:

...
zip.OemCodePage = 1252;
bool success = zip.OpenZip(\"myZip.zip\");
...

Privacy Statement. Copyright 2000-2011 Chilkat Software, Inc. All rights reserved.
Send feedback to support@chilkatsoft.com
Components for Microsoft Windows XP, 2000, 2003 Server, Vista, Windows 7, and Windows 95/98/NT4.