[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

FTPAPI - char set conversions, status report



Sender: Christian <chrisv5@xxxxxx>

Hello everyone,

I have some good news for you. I successfully managed to write a "proof of concept" program (which simply copies files with CCSID conversion), which flawlessly converts any SBCS (single byte character set) to/from UTF8 (1208) and UCS2 BE (13488). This despite the fact that the IBM documentation of iconv() is INCOMPLETE *and* WRONG and that there's a bug in the implementation.

The bug occurs when you use a three byte input buffer, which is not very likely, but helped to test border cases; for anyone who wants to know: it'll treat the UTF8 representation of the "zero width no-break space" as an illegal character when converting to SBCS. I's fine when converting to UCS2 (though the character gets incorrectly dropped, which is very sad, cause UCS2 BE would really benefit from a "byte order mark"). It's also fine if the input buffer is four bytes or larger.

It's actually two bugs in one. When it is supposed to convert the character, it actually drops it (which happens regard less of buffer size) and when it would be perfectly legal to drop the character (hey, "a zero width space" is quite similar to "no character"), it throws an error if, and only if it is the only character to convert (it's hex EFBBBF, which fits just into a three byte buffer by itself).

Why did I use such small input buffers? Well, actually I used tiny input buffers vs. large output (3 vs. 30000) and vice versa in order to test the handling of truncated multi-byte characters. Which works just fine!

Anyway, the next step will be testing EBCDIC DBCS & Mixed Byte character sets, but it is damn tough as the IBM documentation on that topic really sucks. They got language IDs, code pages, language names, keyboard settings and stuff, but hardly ever mention CCSIDs. Oh well, I won't give up.

And now comes the bad news: apparently the IBM implementation of iconv() does not support (according to their shabby docs) non-EBCDIC Mixed Byte character sets on input. If this is true, this plainly sucks. It'll mean, that you cannot reliably GET a Japanese/Korean/Chinese file from an FTP server (if it not stored as UTF8 or UCS2). It also means, you cannot PUT the same file from your local IFS to the server reliably.

Here's the stanza which makes me whince:

"The only state-dependent encodings in which iconv() supports the updating of the conversion descriptor shift state is mixed-byte EBCDIC"

Please note that the official guides to iconv() *require* that shift state being preserved for *any* mixed-byte code pages between subsequent calls of iconv(). But then, we all know IBM...

I'll continue testing now, will keep you informed of the progress and as soon as my tests are completed (and I do not need to be too embarrassed about my code) I'll share my test program and I'll ask everyone capable to test it with odd character sets. I am especially thinking on Frank Kolmann and Xue Jiguang here.

Cheers,
Christian
-----------------------------------------------------------------------
This is the FTPAPI mailing list.  To unsubsribe from the list send mail
to majordomo@xxxxxxxxxxxxx with the body: unsubscribe ftpapi mymailaddr
-----------------------------------------------------------------------