Chilkat Email Components Home

Recover Text File w/ Mixed utf-8 and ANSI Characters

Back

Question:
I have a text file that contains a mixture of utf-8 character data and ANSI character data. How can I convert just the utf-8 bytes to ANSI? (or instead convert just the ANSI bytes to utf-8?)

Answer:
There is no perfect solution. The best you can do scan the bytes one by one and then pick off the sequences that are most likely to be utf-8 bytes. If the utf-8 bytes are typically Western European characters with diacritics (i.e. accent marks) the best 2-byte sequence to look for is a 0xC2 or 0xC3 followed by a byte greater than 0x80. As an example:

        (consider this to be pseudo-code)
                    int ii;
                    for (ii=0; ii<(numBytes-1); ii++)
                        {
                        if ((hData[ii] == 0xC3) || (hData[ii] == 0xC2))
                            {
                            if (hData[ii+1] >= 0x80)
                                {
                                // This is a *probably* a utf-8 character
                                }
                            }
                        }

It won't be perfect, but my guess is that it will be 98% perfect...