[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Problem parsing XML with EXPAT



Thank you Scott,
I appreciate you taking the time to explain that and I'm glad you did. I
thought the problem was with the CSSID I was using. The fact that my
receiver routines were being called should have been my first clue that
the data being passed to EXPAT was in an acceptable format and that I
just wasn't handling the data being returned correctly.
You were right about me using %str() to decode the data being returned.
Changing my program to receive the string in the (UCS-2) data type
solved my problem. 
Thanks again, I really appreciate your help.
Griz


-----Original Message-----
From: ftpapi-bounces@xxxxxxxxxxxxxxxxxxxxxx
[mailto:ftpapi-bounces@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Scott
Klement
Sent: Wednesday, November 28, 2007 11:19 PM
To: HTTPAPI and FTPAPI Projects
Subject: Re: Problem parsing XML with EXPAT

Hi Griz,

Grab a sandwich and relax, while I give you the long-winded
explanation...

It's important to understand that XML supports many different character 
sets, and that XML is not specifically designed for i5/OS.

i5/OS has this really neat feature where every file in it's file systems

has a "ccsid" in the object description.  This way, when you save a file

to disk, you can store the CCSID that represents it's character set with

the file, and other applications/users/utilities, etc can read that 
CCSID and know what character set the data is in.

However, AFIAK, no other computer system has this feature.  Windows, 
Unix, Mac, etc... none of them have this CCSID feature.  Therefore, it 
can't be used for the XML standard.

So the XML standard says that XML documents will designate their 
character set by putting an "encoding" in the opening <?xml> tag.  And 
if there's no encoding there, a parser should just ASSUME that the data 
is in UTF-8 format.

<?xml version="1.0" encoding="iso-8859-1"?>

So if your document starts with something containing "encoding", like 
the preceding example, then that encoding tells the XML parser what 
format it's in.  If not, then it's assumed to be in the UTF-8 flavor of 
unicode.

But -- there's still a problem.   Did you spot it?

HOW THE HECK CAN AN XML PARSER READ THE "ENCODING" TAG IF IT DOESN'T 
ALREADY KNOW WHAT CHARACTER SET THE DATA IS IN?  Think about that.  It's

a catch-22.  The parser has to know which character set it's reading in 
order to be able to understand the "encoding" attribute, and therefore 
discover the character set.

So the symbols that make up the opening XML tag must always have 
particular hex codes.

Fortunately, it's pretty easy.  The four basic encodings of XML are 
US-ASCII, ISO-8859-1, UTF-8 and UTF-16.   In the first three (US-ASCII, 
ISO-8859-1 and UTF-8) the hex code of the < character is always x'3c'. 
The hex code of the ? character is always x'3f'.  So there's no 
conflict.  In UTF-16, the < character is always either x'003f' (big 
endian) or x'3f00' (little endian).  So it's pretty easy for a program 
to read the first two bytes of a file and determine enough about the 
encoding to be able to read the opening <?xml> tag to determine what the

actual encoding is.

With those rules, the XML standard will work with any flavor of ASCII or

single or double-byte Unicode without having to know the encoding before

opening the file.   However, it'll NEVER work with EBCDIC.  A proper XML

parser does not work with EBCDIC data.

Whew.  Like I said... that was long winded.  But it explains why you 
have to translate your data to ASCII to parse it with Expat.

Note that IBM's XML parser that's built into ILE RPG (V5R4 feature) does

work with EBCDIC.  I talked to Barbara Morris about this, and she said 
that this built-in XML parser always uses the CCSID of your job for 
alphanumeric fields, UCS2 for UCS-2 data type fields, and the CCSID that

the stream file is tagged with when reading a file.  Technically, this 
behavior is wrong, because it completely ignores the "encoding" 
attribute -- which violates the XML spec.  It also means that if you 
transfer a file from another system, you have to make darned sure that 
you set the CCSID correctly on the file.  And I have no idea how you'd 
be sure of that without writing your own code to interrogate the file! 
But all of this is moot since you're using Expat, and Expat does 
respsect the XML standard. (It's just the built-in one in RPG that does 
not.)

Okay... your 2nd question about getting blanks back from Expat... that 
one I don't understand.  I guess it could potentially be because you 
haven't specified an encoding, so it'll default to UTF-8.  But if your 
data is in ASCII rather than UTF-8, you might have problems with some 
characters.   However, this probably ISN'T it, since most characters in 
ASCII are the same as UTF-8, and therefore this wouldn't happen on every

element.

The only other thing I can think of is that your handler procedures are 
written incorrectly, and are using %STR() to decode the data sent to 
them.  In that case, only less common strings (such as accented 
characters, asian alphabets, etc) would start with anything other than 
x'00'.  And %str() uses x'00' to denote "end-of-data"... so your data 
would always come back as zero-length strings.

To fix it, you'll either have to compile Expat to output UTF-8 (which I 
don't recommend, since UTF-8 is not easy to deal with in RPG).  Or 
you'll have to receive the strings using RPG's C (UCS-2) data type like 
I do in the sample programs included with the Expat download.


-----------------------------------------------------------------------
This is the FTPAPI mailing list.  To unsubscribe, please go to:
http://www.scottklement.com/mailman/listinfo/ftpapi
-----------------------------------------------------------------------