Monday, February 21, 2005

An old XML friend

A coworker of mine noted that we had this bug: our system would turn filenames with multiple embedded spaces in them into a single space. The culprit? XML. We pass the filenames to a backend server via XML, and XML is not space-preserving for ordinary parsed character data (PCDATA) inside of elements (between start and end tags). There are three solutions:

  1. Escape the spaces
  2. Use CDATA
  3. Write a space-preserving DTD

Of these, I prefer the second, using CDATA. It is the simplest to implement and makes the most readable XML as one does not need to mentally translate the escaped characters. However, it does require that the receiving XML parser understand more than just tags and attributes (and some hand-writter parsers do not, in fact, do any more than just this). The first, escaping the spaces, is probably the most portable but requires work in properly escaping anything in the data that might cause trouble.

The thrid, writing a space-preserving DTD, is the most intersting. In fact, it is probably the most correct solution of all from the perspective of elegance and clarity, but requires the most support from the receiving end. Caveat emptor.

8 comments:

Anonymous said...

Are you sure about this?
XML parsers are required to normalise spaces, tabs, CR and LF in attibute values but should not do so un elements. Non validating parsers are required to report all whitespace as significant. The parser must normalise end of line sequences but should not modify spaces or tabs in elements.
Validation parsers can report spaces between elements as insignificant but should not normalise it.
What parser are you using?

Brian Oxley said...

I was less exact in my terminology that I should have been. I was referring to parsed character data. The code I was helping passed argument strings as PCDATA, and had a problem with multiple spaces being compressed into a single space.

Anonymous said...

Yes, I understood that. However, the parser is not allowed to normalise white space in an element (it is required to normalise line endings, expand CDATA sections and expand entity references but it must not mess with spaces).

It sounds like you have a broken parser (it's not a Windows system by any chance is it?).

Brian Oxley said...

Ah, yes I see. I understand from the developer who asked me about the problem that he is using a Microsoft parser. However, we also need to preserve whitespace at the beginnings and endings of data as well as internally. You ask interesting questions.

Anonymous said...

The parser should preserve spaces at the begining and end of the element as well. However, Microsoft has a history of taking a rather relaxed view of the XML spec rules on whitespace handling. I'm rather suprised that a modern Microsoft parser would be this much out of kilter. Are you shure that you are not using an attribute to hold the file path?

Brian Oxley said...

Yep, not an attribute. Why would a parser preserve space for plain text between tags?

Anonymous said...

A non validating parser has to treat all spaces as significant. A validating parser is told by the DTD if spaces are significant or not. They both have to report all the spaces (unless they are in attributes) but a validating parser reports spaces the DTD says are not significant as ignorable (e.g. it uses the ignorableWhitespace callback in SAX).
Microsoft has a history of ignoring white space between tags. They got it wrong early on and didn't want to break code which depended on their mistake.

Brian Oxley said...

That's quite interesting, John. Thanks for the information!

Sadly, our front end is talking to an old hand-crafted XML parser on the back end that does not support DTDs, CDATA or even escapes. Slow death by debugging for the author of that code, I tell you.