[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
FTPAPI - char set conversions, status report
Sender: Christian <chrisv5@xxxxxx>
Hello everyone,
I have some good news for you. I successfully managed to write a "proof
of concept" program (which simply copies files with CCSID conversion),
which flawlessly converts any SBCS (single byte character set) to/from
UTF8 (1208) and UCS2 BE (13488). This despite the fact that the IBM
documentation of iconv() is INCOMPLETE *and* WRONG and that there's a
bug in the implementation.
The bug occurs when you use a three byte input buffer, which is not very
likely, but helped to test border cases; for anyone who wants to know:
it'll treat the UTF8 representation of the "zero width no-break space"
as an illegal character when converting to SBCS. I's fine when
converting to UCS2 (though the character gets incorrectly dropped, which
is very sad, cause UCS2 BE would really benefit from a "byte order
mark"). It's also fine if the input buffer is four bytes or larger.
It's actually two bugs in one. When it is supposed to convert the
character, it actually drops it (which happens regard less of buffer
size) and when it would be perfectly legal to drop the character (hey,
"a zero width space" is quite similar to "no character"), it throws an
error if, and only if it is the only character to convert (it's hex
EFBBBF, which fits just into a three byte buffer by itself).
Why did I use such small input buffers? Well, actually I used tiny input
buffers vs. large output (3 vs. 30000) and vice versa in order to test
the handling of truncated multi-byte characters. Which works just fine!
Anyway, the next step will be testing EBCDIC DBCS & Mixed Byte character
sets, but it is damn tough as the IBM documentation on that topic really
sucks. They got language IDs, code pages, language names, keyboard
settings and stuff, but hardly ever mention CCSIDs. Oh well, I won't
give up.
And now comes the bad news: apparently the IBM implementation of iconv()
does not support (according to their shabby docs) non-EBCDIC Mixed Byte
character sets on input. If this is true, this plainly sucks. It'll
mean, that you cannot reliably GET a Japanese/Korean/Chinese file from
an FTP server (if it not stored as UTF8 or UCS2). It also means, you
cannot PUT the same file from your local IFS to the server reliably.
Here's the stanza which makes me whince:
"The only state-dependent encodings in which iconv() supports the
updating of the conversion descriptor shift state is mixed-byte EBCDIC"
Please note that the official guides to iconv() *require* that shift
state being preserved for *any* mixed-byte code pages between subsequent
calls of iconv(). But then, we all know IBM...
I'll continue testing now, will keep you informed of the progress and as
soon as my tests are completed (and I do not need to be too embarrassed
about my code) I'll share my test program and I'll ask everyone capable
to test it with odd character sets. I am especially thinking on Frank
Kolmann and Xue Jiguang here.
Cheers,
Christian
-----------------------------------------------------------------------
This is the FTPAPI mailing list. To unsubsribe from the list send mail
to majordomo@xxxxxxxxxxxxx with the body: unsubscribe ftpapi mymailaddr
-----------------------------------------------------------------------