How not to do XML
Imagine for the moment there was a PHP blog software that has the ability to dump the blog posts into some sort of extended RSS 2 feed and import from there later and probably from a different installation. That's nice, XML is a flexible format and RSS allows extensions via namespaces. Even better, there are XML parsers for all major programming languages and from python working with XML is especially cool because of lxml and element tree. But there is a problem with that...
...that XML, is not XML. It's called WordPress eXtended RSS (WXR) but it's not XML? And why in god's name did nobody notice so far? I mean, WordPress must have an importer for that.
Why it's not XML? It has XML syntax, XML namespace declarations but what doesn't it have? A doctype. What's the problem? It's referencing HTML entities! So step one for parsing: inject an inline DTD that defines those entities. Great fun isn't it? Then it parses. I was happy and finished my work. That XML doesn't have HTML entities is something PHP developers probably don't know and their parser isn't resolving any entities during the parsing process. Or worse, their XML parser expands HTML entites.
But it's worse! I loaded another dump that happened to have some broken HTML in comments (could happen, does happen, thanks broken trackback support). What happens next? THE XML DOESN'T PARSE ANY MORE! Why? Because comments are neither escaped nor marked as CDATA. I wonder why, especially because it's so much easier to handle embedded HTML/XHTML for dumping as cdata and not XML, especially if you are working with PHP.
But WordPress was able to import that.... so I looked at their parser.... WORDPRESS PARSES THAT WXR FILE USING REGULAR EXPRESSIONS!!! Argharhgarhghargh. That's not XML what you are doing there, that's nothing. WordPress can't even parse it's own file if you bind the WordPress exporter namespace to a different prefix! WordPress can't handle it's own file if you replace their CDATA foobar against properly escaped stuff. Dammit!
I can't even write a proper exporter using XML tools because what my XML tools generate is not compatible to WordPress. And what tops it all?
Reading that in the #wordpress channel:
<nickname_deleted> why does it matter what wp's xml format has flaws?
adapt your importer to the flaws
WordPress is a state-of-the-art semantic personal publishing platform with a focus on aesthetics, web standards, and usability.
Without further comments... I lost my faith into standards that moment. Wait a second, I lost it earlier. Still sad.
Those that do not understand XML will be forced to reimplement it badly, using regexps.
To be honest, RSS2.0 is notoriously badly specced, hence Atom. One of the issues is embedded HTML.
At least it's something that's close to XML, try looking at Movable Type's export "format" for something that will sear your soul.
— Gustaf Erikson on Monday, February 18, 2008 12:06 #
If you try an use a non-sloppy XML parser on something you snooped of the web you are naiive. Atleast, that's what I've learned over the past year.
Even if you pay a data provider for XML data (signed in the contract) you can not expect proper XML (also personal experience). You can yell and scream, complain, and refuse to pay - but in the end your users are waiting. You write that darn sloppy parser and get on with your life.
Sorry - but that is how it is.
— Mikkel KamstruErlandsen on Monday, February 18, 2008 13:35 #
php software, what do you expect?
— Florian on Monday, February 18, 2008 13:53 #
The real reason for this is that XML is so difficult to map onto most programming languages. The datatypes you end up working with are usually complicated.
It seems designed without any knowledge of programming languages and the common ADT's. I tend to use json for machiene-talk (when I get to program both the client and server). It maps nicely onto most programming languages.
Ways XML is abused: 1) people treat it like XHTML (cause: the default escaping in XML sucks, example: RSS) 2) people map another structure onto XML (cause: its a syntactic not a semantic spec, it does not tell a programming language which types correspond to the parsed string, example: xml-rpc) 3) people use it as a database (cause: they follow the hype, example: rhythmbox, gconf)
People that do not understand XML are doomed to make the mistake of using it. We need a semantic interpretation. Such that a valid XML parser would turn a XML-integer into a signed integer, rather than returning only legal XML-integers as string which are then fed into some utility function of the programming language to be parsed. That utility function may accept a completely domain. Et voila, a nice, almost untraceble creepy bug introduced.
We need a new data-standard. One that also specifies the semantic interpretation of the data. We also need a database-query-language-standard (to replace SQL, another horrible standard) that uses that new data-standard as the communication language. Also it would be nice (read: extremely important) if the data-standard not only included things such as integers, floats, strings, unicode, but also anynimous functions. (off course this requires some sort of minimal programming language model)
The closest thing so far?
Data-standard: JSON,
Query-standard: CouchDB-style-queries
Language-standard: Javascript? Lambda-calculus?
Even javascript is already too complicated to easily create a parser and evaluator for. Perhaps just a lambda calculus? Just enough to communicate a piece of logic as data.
I can't wait to see XML die. It's the wrong solution to the wrong problem.
— Meneer R on Monday, February 18, 2008 15:09 #
Worse is better =)
— Shanti Braford on Monday, February 18, 2008 15:17 #
Typo, it should have been:
>> That utility function may accept a completely DIFFERENT domain. Et voila, a nice, almost untraceble creepy bug introduced.
I was talking about using a correct XML parser with a correct DOCTYPE that specifies legal integers. Yet the parser will return the legal integer as a STRING. This is not something easily overcome. It is an inherent aspect of the xml specification.
If anything doctypes only are like a parser specification, except you can only specify a subset of all parsers. Other than saying a document is legal, it does not associate any interpretation with the data at all.
— Meneer R on Monday, February 18, 2008 15:18 #
Gustaf Erikson, please do not abuse or miss reference quotes. The quote originally goes "those who do not understand UNIX are condemned to reimplement it poorly".
— Philluminati on Monday, February 18, 2008 15:20 #
Use the regular expressions as a cross-language library, they are a feature here!
— Ali on Monday, February 18, 2008 16:07 #
This type of moronic attitude favoring sloppiness over correctness is endemic in the WP developers. It explains why WP is always coming up with security vulnerabilities. It explains WP's choice of PHP as a language. It explains a lot.
Have you seen the internal WP machinery? Have you seen how slow it is? Serving a page is comparably slow to Plone, but the difference is Plone does so much, much more than WP.
— Rudd-O on Monday, February 18, 2008 16:07 #
@Philluminati: I'm quite aware of the source and provenance of the quote you mention, I was paraphrasing it in a humorous way :)
— Gustaf Erikson on Monday, February 18, 2008 16:22 #
@Meneer R "3) people use it as a database (cause: they follow the hype, example: rhythmbox, gconf)"
I didn't get you. Please explain. As I know XML is generally used for storage of data. And w3c website says, "XML was designed to transport and store data." Link: www.w3schools.com/xml/default.asp
— mridkash on Monday, February 18, 2008 16:49 #
Just to clarify something: I love XML, I just find it very sad that there are very popular applications which just implement it so terrible wrong that it hurts.
— Armin Ronacher on Monday, February 18, 2008 17:52 #
"Also it would be nice (read: extremely important) if the data-standard not only included things such as integers, floats, strings, unicode, but also anynimous functions."
XML is not, and was never meant to be, a programming language.
— Biff on Monday, February 18, 2008 18:12 #
@Meneer R if you want lambda calculus - lisp has been there since ages - its parsable darn easy, and one of the most neat languages out there (simply cause its so extensible)
— Ronny Pfannschmidt on Monday, February 18, 2008 18:14 #
@Meneer R
XML was born out of SGML, which was created within the context of publishing. The impact then is that XML is well suited for document oriented processes. A feed is an excellent example of this in that it is unicode aware and can handle mixed content. The problem mentioned above is that the WXL (or whatever it is called) is not valid XML, which means that the wealth of tools available for working with (valid) XML are useless.
Your point regarding conversion between XML to native language types is a somewhat valid point considering how most people use XML today. I would argue though, that using XML as some sort of data conversion or serialization platform is not the intended use and as such, is problematic. Really the problem of transforming data is a difficult one even within a single language. Take Lisp for example. Its name is derived from list processing! It is not surprising then that folks would have a wealth of problems working with XML as a serialization tool. Even in a rather extensively spec'd technology like SOAP we services, translation of simple types like Strings is non-trivial between relative languages like Java and C#.
I say all this because you are right in saying XML is bad for something like storing ints and string for use by a programming language. It is good for exporting a series of entries as a single document. Having been bit by the same issue in Wordpress, I don't put blame on XML, but rather Wordpress for not creating valid XML.
— Eric Larson on Monday, February 18, 2008 18:29 #
`nickname_deleted`'s advice is certainly the quick-and-dirty naive approach, but it makes one's toolchain tightly coupled to the almost-XML Wordpress format. If this format were produced as well-formed XML and processed using a conformant parser, then other tools could consume and produce this XML in a loosely coupled network, which would lead to a rich environment for manipulating your underlying blog data. Way to call the Wordpress folks on this, Armin!
— John L. Clark on Monday, February 18, 2008 22:11 #
In the long run (ie. you decide to scrape your own site and let Wordpress handle its own export format) you are using regexp anyway, but yeah, it's a shame people can't have fun parsing WXR in 'strict standards compliance mode' and have to overcome wordpress bugs. It is interesting to know that if I write a blog comment that is not XML compliant, then the resulting export file won't be, so I may be planting the seed for future problems for the blog author and/or the sysadmin.
However I am very happy with my sqladmin. I cannot do stuff like DTD or XSL (and it could be a cool thing to do over a WXR file) but it works flawlessly.
Somebody should refactor the whole WXR thing. In my POV it could be done in a compatible way and make everybody happy, but I am not a PHP hacker and I'm afraid I won't see that happening ever.
— eddie on Tuesday, February 19, 2008 0:17 #
Wordpress is not an example of good PHP programming. For the record, it is not difficult to generate valid XML in either PHP4 (Which Wordpress is still using) or PHP5. PHP is not the problem, lazy developers are.
— Michael Gauthier on Tuesday, February 19, 2008 2:48 #
Apostrophe's and they're use's.
— Grammar Nazi on Tuesday, February 19, 2008 5:25 #
@11) mridkash
"As I know XML is generally used for storage of data. And w3c website says, “XML was designed to transport and store data.”
Which seems an extremely broad, non specific definition. What is not data?. What about that definition is anything different from the definition of a file? Also designed to transport and store data. Yet nobody every confused a database with a file-system. (well, they did when they planned vista, but they never got around to implement that, not suprisingly)
The question remains: what kind of data, for what purposes? How would you store audio in XML? Audio is data right? Is it designed to store audio? The fact you came with this defintions only illustrates the confusion.
Erik larson @15 might provide an answer. He made the specific claim it is well suited for document based data. There is definately something to say for that. It seems wel suited for a specific domain of data. Esspecially where we need a tree like structure. Those are much less verbose than say when using a relational mapping.
As to using it as a database is a mistake, I'm referring to these obvious facts: a) its space inefficient b) it has no defined query language, nor indexes c) it does not scale to large amounts of data
Perhaps point c is less obvious. It does not scale because you can't query/search an xml file, without having the parse and deal with the whole thing. Consider the algorithmic complexicity class of the search algorithm. Yeah, its linear. That won't scale. Do not use it as a database.
You either have to go through the whole file over and over again, for each query, or you load the file into memory completely and create indexes (read: your own crappy DBMS). Off course your custom hand-crafted memory based database engine needs to map your XML correctly onto the ADT you use to store in memory, or otherwise a bunch of hard-to-trace-bugs are going to show up.
The thing is, either the performance is repulsive (rhythmbox?) or you are dealing with two different ADT's, one is used to store the data, the other to manage the data when you use it. In those situations its just another layer of abstraction and conversion bugs.
But to respond some more to Erik Larson @15. It is even badly suited, as a standard, not as a technique, for document based data. The thing is: the standard doesn't define any document. Rather, it provides a template to define a document standard. But without any default interpretation, XML is not a standard at all, any more as an ordinary file is. They all have names for example. They all have mime-types too. (say the doctype of a file). If you want to organize them into a hierarchy you can use a directory. When you look at it like that, a normal unix filesystem, or a .tar archive provide the exact same abstraction. Except those formats are actually performant, unlike XML.
Now, that's the weird thing. I'm not saying a file-system or .tar file is a preferred abstraction for documents. I am saying that files are not a standard, nor is XML.
An ini file is a standard. A .odt file is a standard. But XML is not a standard at all. It needs interpretation; a mapping to something concrete. You can pinpoint almost all XML abuses back to this misconception that XML truly defines anything except a BNF grammer. It doesn't.
To Biff, @13, "XML is not, and was never meant to be, a programming language."
I wasn't claiming it was. But without containing default, usefull, datatypes, that actually map the datatypes we find in 99% of the programming languages out there, the conversion into those types is going to introduce more bugs and problems than XML could possible solve.
Perhaps we need a better definition of the problem we are triying to solve here. Here's the problem I think you would want XML to solve:
"The exchange of typed data (types suggesting not only the syntax, but also the semantics). In such a way you can easily create, move, import and work with that data from the ecoystem of programming languages."
Given that definition, the perfect data-exchange language I would come up, would also contain some sort of turing-machiene. Because exchanging behavior, logic in a modular, interchangeable way between programs is something we definately need. But that was just dreaming out lead. At this point, if the world settled onto a data-exchange language that didn't do this much harm, i would be a happier man.
To Ronny Pfannschmidt, @15. Although lisp has some of the most intelligent primitives; that is, it is quite powerfull using only a minimal set of operations, i doubt the syntax would make the format very human-friendly. I would rather go for a haskell-style syntax; perhaps even without the default polish notation. The perfect candidate however, needs to be very easy to support from within another language (esspecially generating it) as well being human readable and usable. It would also be the default query language.
If anything I can understand the feeling some of you have to step up and defend XML. It's foundation and the reason people use it has so much idealism and good intentions behind it, it's hard to attack. I am however, also afraid that that is the reason why this wasn't shot down as much as should have been.
I dare to claim it is no accident that most XML parsers are actually not valid at all. The majority of people that would produce a valid XML parser with their hands tight behind their back would think twice before even considering to use XML in the first place. There might be a negative correlation between skill and the probability of using XML. At least when you limit skill to those who with an academic background.
— Meneer R on Tuesday, February 19, 2008 6:26 #
I would expect you didn't lose your faith in open standards but in php projects of decent quality. I do not want to say its because of the language (...) but definitely because of the type of programmers which use it. We have seen such terrible PHP things (ranging from custom made stuff to out of the box open source packages). To such an extend that it builds the believe in your mind that php software equals crap software. I hope to be proved wrong in the future!
— Herman Bos on Tuesday, February 19, 2008 7:57 #
The general problem with the "web community" is that no one follows standards and this have been ongoing since the beginning of the WWW. Everybody is looking for an easy way out (even those that implement browsers). The fact that WordPress do not use "XML" isn't really a shock, it's to expect.
I have high doubts that this mentality will change. Even those that scream "web-standards" and implement their pages tableless use shit loads of hacks in order to render their page properly in different browsers.
— amix on Tuesday, February 19, 2008 12:22 #
> An ini file is a standard. A .odt file is a standard. But XML is not a standard at all. It needs > interpretation; a mapping to something concrete. You can pinpoint almost all XML abuses back > to this misconception that XML truly defines anything except a BNF grammer. It doesn’t.
XML : Extensible Markup Language Its all very simple, you just don't get it. XML, in the end, is nothing more than a form of meta-data. It just describes the data, so I'm sorry it cannot be tied down to any one concrete thing/application because data comes in all shapes and sizes, and one mans data is another mans rubbish.
— author on Tuesday, February 19, 2008 14:41 #
<strong>Migrated to WordPress...</strong>
First of all, apologies for any ‘planet spam’ caused the change to my feeds. After what seems like an eternity (but is actually just over a year) I’ve switched the backend of this site from Mephisto to WordPress. The main reason for t...
— schwuk.com on Thursday, February 21, 2008 23:02 #
All XML problems seems solved in latest wordpress, now its standards compliant I guess
— Wajahat on Sunday, March 7, 2010 2:49 #