[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem parsing XML with EXPAT



Hi Griz,

Grab a sandwich and relax, while I give you the long-winded explanation...

It's important to understand that XML supports many different character 
sets, and that XML is not specifically designed for i5/OS.

i5/OS has this really neat feature where every file in it's file systems 
has a "ccsid" in the object description.  This way, when you save a file 
to disk, you can store the CCSID that represents it's character set with 
the file, and other applications/users/utilities, etc can read that 
CCSID and know what character set the data is in.

However, AFIAK, no other computer system has this feature.  Windows, 
Unix, Mac, etc... none of them have this CCSID feature.  Therefore, it 
can't be used for the XML standard.

So the XML standard says that XML documents will designate their 
character set by putting an "encoding" in the opening <?xml> tag.  And 
if there's no encoding there, a parser should just ASSUME that the data 
is in UTF-8 format.

<?xml version="1.0" encoding="iso-8859-1"?>

So if your document starts with something containing "encoding", like 
the preceding example, then that encoding tells the XML parser what 
format it's in.  If not, then it's assumed to be in the UTF-8 flavor of 
unicode.

But -- there's still a problem.   Did you spot it?

HOW THE HECK CAN AN XML PARSER READ THE "ENCODING" TAG IF IT DOESN'T 
ALREADY KNOW WHAT CHARACTER SET THE DATA IS IN?  Think about that.  It's 
a catch-22.  The parser has to know which character set it's reading in 
order to be able to understand the "encoding" attribute, and therefore 
discover the character set.

So the symbols that make up the opening XML tag must always have 
particular hex codes.

Fortunately, it's pretty easy.  The four basic encodings of XML are 
US-ASCII, ISO-8859-1, UTF-8 and UTF-16.   In the first three (US-ASCII, 
ISO-8859-1 and UTF-8) the hex code of the < character is always x'3c'. 
The hex code of the ? character is always x'3f'.  So there's no 
conflict.  In UTF-16, the < character is always either x'003f' (big 
endian) or x'3f00' (little endian).  So it's pretty easy for a program 
to read the first two bytes of a file and determine enough about the 
encoding to be able to read the opening <?xml> tag to determine what the 
actual encoding is.

With those rules, the XML standard will work with any flavor of ASCII or 
single or double-byte Unicode without having to know the encoding before 
opening the file.   However, it'll NEVER work with EBCDIC.  A proper XML 
parser does not work with EBCDIC data.

Whew.  Like I said... that was long winded.  But it explains why you 
have to translate your data to ASCII to parse it with Expat.

Note that IBM's XML parser that's built into ILE RPG (V5R4 feature) does 
work with EBCDIC.  I talked to Barbara Morris about this, and she said 
that this built-in XML parser always uses the CCSID of your job for 
alphanumeric fields, UCS2 for UCS-2 data type fields, and the CCSID that 
the stream file is tagged with when reading a file.  Technically, this 
behavior is wrong, because it completely ignores the "encoding" 
attribute -- which violates the XML spec.  It also means that if you 
transfer a file from another system, you have to make darned sure that 
you set the CCSID correctly on the file.  And I have no idea how you'd 
be sure of that without writing your own code to interrogate the file! 
But all of this is moot since you're using Expat, and Expat does 
respsect the XML standard. (It's just the built-in one in RPG that does 
not.)

Okay... your 2nd question about getting blanks back from Expat... that 
one I don't understand.  I guess it could potentially be because you 
haven't specified an encoding, so it'll default to UTF-8.  But if your 
data is in ASCII rather than UTF-8, you might have problems with some 
characters.   However, this probably ISN'T it, since most characters in 
ASCII are the same as UTF-8, and therefore this wouldn't happen on every 
element.

The only other thing I can think of is that your handler procedures are 
written incorrectly, and are using %STR() to decode the data sent to 
them.  In that case, only less common strings (such as accented 
characters, asian alphabets, etc) would start with anything other than 
x'00'.  And %str() uses x'00' to denote "end-of-data"... so your data 
would always come back as zero-length strings.

To fix it, you'll either have to compile Expat to output UTF-8 (which I 
don't recommend, since UTF-8 is not easy to deal with in RPG).  Or 
you'll have to receive the strings using RPG's C (UCS-2) data type like 
I do in the sample programs included with the Expat download.


Grizzly Malchow wrote:
> Forgive me if I'm not posting this to the correct list, but I figured I
> would start here since I downloaded EXPAT with HTTPAPI.
> 
> I am having trouble parsing some real simple XML. 
> I have a web page that posts the following XML to my RPG CGI program.
> 
> <?xml version="1.0"?>
> <CustomerRequest>
> <Customer>6515555555</Customer>
> </CustomerRequest>
> 
> When I read STDIN I can see that the XML is being posted as follows: 
> <?xml
> version="1.0"?><CustomerRequest><Customer>6515555555</Customer></Custome
> rRequest>
> 
> When I passed the XML string to XML_Parse I receive the error: not
> well-formed (invalid token)
> 
> I changed my program to convert the data coming from STDIN to ASCII
> prior to calling XML_Parse and I am able to get to the procedure I
> specified to handle the start of the element, but the element being
> returned is blank. 
> Does anyone have an idea as to what could be causing this? I'm sure I'm
> missing something, but I don't know what.
> Thanks in advance,
> Griz
> 
> 
> -----------------------------------------------------------------------
> This is the FTPAPI mailing list.  To unsubscribe, please go to:
> http://www.scottklement.com/mailman/listinfo/ftpapi
> -----------------------------------------------------------------------

-----------------------------------------------------------------------
This is the FTPAPI mailing list.  To unsubscribe, please go to:
http://www.scottklement.com/mailman/listinfo/ftpapi
-----------------------------------------------------------------------