The Incredible C++ XML Parser and JSON Parser
Small, simple,
cross-platform, scalable and fast
C++ XML Parser
In 2003, I started working on XML technology and produce my
first XMLParser
library. This old library is now used in thousands of applications all around the
world (and also in space! 😲 ). The main objective of old XMLParser library was to allow me to easily manipulate
input/ouput configuration files and xml data files. The old library was limited to relatively
small data files (typically, smaller than 10MB) because it's a pure DOM-style parser 😒 .
During the next 10 years, I received many emails from coders using the old XMLParser library to parse
larger and larger files (some individual use it to parse 300MB XML files!). Altough the old library managed to parse these larger files, it consumed a
very large amount of RAM memory (sometime up to 10GB) and of CPU ressources. Furthermore, I am now
manipulating (inside Anatella) terabyte-size XML files. In May 2013, I decided that it was
time for an "upgrade"! 😉 ...and the Incredible XML Parser was born! 😊
The Incredible XML Parser is composed of only 2 files: a .cpp file and a .h file.
The total size is 280 KB.
The Incredible XML Parser library includes three parsers: It has:
- An ultra fast XML Pull Parser (that is named
"IXMLPullParser") that requires very little memory to run. The Pull
Parser is ultra fast but it does not offer the flexibility and the user-friendliness
of a full-fledged DOM parser.
- A very fast XML DOM parser
(that is named "IXMLDomParser")
(The Dom parser is built "on-top" of the Pull Parser) that provides more comfort when manipulating XML
elements. It works by using recursion and building a node tree for breaking down the elements of an XML
document.
- An ultra fast JSON Pull Parser (that is named
"IJSONPullParser") that requires very little memory to run. The JSON Pull
Parser is ultra fast and is compatible with the Incredible XML DOM Parser
so that you can build (with the DOM Parser) a node tree in-memory that allows you to easily&quickly explore your JSON file (for
example: using advanced XPATH queries: see example12!).
The Incredible XML DOM Parser, the Incredible XML Pull Parser and the Incredible JSON Pull Parser can all process terabyte-size
XML/JSON files in a few hours on commodity hardware with very low memory consumption
(i.e. less than a few megabyte).
The objectives of the Incredible XML/JSON Parser are the same as the old XMLParser library:
- user-friendliness (i.e. it should be easy to use).
- Small foot-print & no dependencies (i.e. this must remain a small library, easy to include & compile everywhere, on any plateform).
And, in addition, it provides even more speed & scalability.
For the Incredible XML Parser, I kept all the
nice functionnalites from the old XML Parser that made it so popular and I added the following:
- The Incredible XML Pull Parser has one of the lowest memory
consumption amongst all XML Pull parsers.
- The Incredible JSON Pull Parser has one of the lowest memory
consumption amongst all JSON Pull parsers.
- The Incredible XML DOM Parser has the lowest memory consumption amongst all XML DOM parsers.
- The Incredible XML Pull Parser is one of the fastest XML Pull parser.
- The Incredible JSON Pull Parser is one of the fastest JSON Pull parser.
- The Incredible XML DOM Parser is the fastest XML DOM parser.
- The Incredible XML DOM Parser is the only DOM parser able to work on UNLIMITED file size.
- The 2 Incredible XML Parsers are able to handle nearly any character encodings.
- The 3 Incredible Parsers fully support "char*" mode and "wchar_t*" mode.
- The 3 Incredible Parsers are able to handle stream-lined data. This has several advantages:
- you are not limited anymore by your RAM memory size.
- very reduced and (more or less) constant memory consumption.
- you can process very easily stream-lined data (such as data coming from an
HTTP connection or the data coming from the decompression of a ZIP file).
- The 3 Incredible Parsers are 100% thread-safe (more precisely: they are reentrant).
- The Incredible XML&JSON Pull Parsers are "in-place" parser (They do not copy internally any strings, so that it's as fast as possible).
- The Incredible XML&JSON Pull Parsers are one of the easiest-to-use XML Pull parsers(because they always return zero-terminated char* or wchar_t*, in opposition to other "in-place" parsers).
- The Incredible XML Dom Parser supports "hot starts" and is able to parse a sub-section of the
original XML file without doing any memory allocation at all. The "hot start" functionality is unique
and is very important because it allows us to use a very flexible DOM-style Parser on UNLIMITED XML&JSON file
size (see example 7 inside the documentation) using very little RAM memory.
- The Incredible XML Dom Parser provides an ultra fast XPATH support. With XPATH, you can find very easily inside any XML&JSON file the information that you need.
- The Incredible XML Parser has an extensive (doxygen) documentation
- The Incredible XML Dom Parser is a good replacement for the
old XMLParser library (The IXMLNode class from the Incredible XML Dom Parser is a direct replacement to the XMLNode class from the
old XMLParser library).
- The Incredible XML Parser is easy to customize: The code is concise, commented and written in a plain and
simple way. Thus, if you really need to change something (but I doubt of it), it's easy.
To the best of my knowledge, there exists no other "non-validating C++ XML
parser" that is as simple and as powerfull. 😄 This is especially true if you need to
parse large XML documents: In such a case, there are no parser that comes even close to the
Incredible XML Parser presented here.
I originally selected the name "Ultimate" for the XML Parser because I cannot see how it would be possible to improve on
the XML Parser Library presented here 😜. Of course, you can always add features such as "XML Validation",etc. but it
will only produce a slower, more "bloated" library. It's really the "Incredible XML parser" 😏 and if you are a
professional developper serious about your work, you should use the "Incredible XML parser" and no other parser 🙏.
License
The Incredible XML Parser is distributed under the Aladdin Free Public License(AFPL).
The old XLMParser library is completely free and will remain free forever.
The Incredible XML parser is also completely free in these situations:
- You only need the Aladdin Free Public License(AFPL).
- You need another license (e.g. a BSD license or a MIT license) but you'll use the Incredible XML Parser inside:
- a computer video game (or anything related to video games).
- a software for a charity organization.
If you are not in the situations described herabove, you can still buy a BSD license (or MIT license) to use
the XML Parser inside all your projects: Simply
to request your license.
Download
If you like this library, you can create a URL-Link towards this page from your
website (use this URL: http://www.applied-mathematics.net/tools/IXMLParser.html).
If you want to help other people to produce better softwares using XML technology,
you can increase the visibility of this library by adding a URL-link toward this
page (so that its google-ranking increases !).
If you like this library, please add
a message in the guestbook
!
To obtain the library, simply
, and I will send to you the Incredible XML Parser directly, the same day. You will receive by e-mail a zip-file.
Inside the zip file, you will find 5 examples:
- ansi (char*) unix/solaris project example (makefile based)
- ansi (char*) windows project example (for Visual Studio .NET)
- ansi (char*) windows .dll project with a small test project to check the generated .dll
- wide char (wchar_t*) unix/solaris project example (makefile based)
- wide char (wchar_t*) windows project example (for Visual Studio .NET)
Log
Version changes:
- v3.01: May 19, 2013: initial version.
- v3.02: May 24, 2013: Various bug fixes & improvements.
- v3.03: May 24, 2013: Performed extensive testing on large documents and fixed some remaining small bugs.
- v3.04: May 28, 2013: changed the name from "Ultimate" to "Incredible" XMLParser.
- v3.05: May 30, 2013: 2 additions
- v3.06: June 12, 2013: 1 bug fix, 1 addition
- FIX: compilation with new gcc.
- added support for UTF32
- v3.07: June 26, 2013: 2 additions
- added better support for "Max Memory Reached" error inside DOM parser
- added better support to detect errors in JSON files
- v3.08: July 18, 2013: 2 additions
- added advanced support for XPATH inside the Pull Parser (the DOM parser already supported advanced XPATH).
- added primitive support for HTML: it's now possible to parse HTML documents!
- v3.09: September 17, 2013: 1 bug fix, 1 addition, 2 minor changes
- FIX: rendering of gb2312-IXMLNode-tree to strings
- added inside example 9 a code that shows how to hangle gb2312 xml files
- changed the structure name "IXMLError" to "IXMLErrorInfo" to avoid collision with MSXML
- changed the structure name "IXMLAttribute" to "IXMLAttr" to avoid collision with MSXML
- v3.10: December 20, 2013: 1 bug fix
- FIX: parser sometime stopped parsing on very long tags
- v3.11: February 21, 2014: 4 additions
- removed 64-bit compilation warnings
- added "getAttribute(String)" method inside PullParser
- various improvements to the "skipBranch()" method (e.g. support for the STRICT_PARSING option)
- improved html support ("input" tag support)
- v3.12: June 9, 2014: 1 bug fix, 1 addition
- fix compilation errors and memory-alignment error on WINCE+ARM platform: thanks to Marco Lizza for that!
- improved XPATH support to search inside HTML files: We now ignore all "tbody" tags: You must modify all
your XPATH expressions in the following way: Replace all "/TBODY/" with "/".
- v3.13: September 15, 2014; 1 bug fix, 1 addition
- removed un-necessary high memory consumption in IXMLPullParser
- added the option "indicesAreZeroBased" inside the methods findPath(), getChildNodeByPath(), getElementByPath()
- v3.14: March 4, 2015: 2 bug fixes
- FIX: now allows spaces between the = char and the " char inside attributes
- FIX: the end of the Clear tags was not properly found
- v3.15: September 8, 2015: 3 bug fixes, 2 additions
- FIX: last tag in JSON was sometime detected wrong
- the empty JSON dictionnary ("{}") is now handled properly
- getColumnNumber() now always returns a correct number
- improved support for HTML parsing (allow ? char anywhere)(allow non-balanced tags)
- added IPullParser::setReader()
- v3.16: September 20, 2015: 1 bug fix, 1 addition
- FIX: fix compilation errors on some compilers (private->protected)
- NaN, Inf and -Inf are now valid JSON numbers (to be compatible with the infamous MongoDB)
- v3.17: December 15, 2016: 1 bug fix, 1 addition
- FIX: the closing of the Cleartag "<Script" was leaving a "<" char behind (fixed by Greg Kochaniak)
- the HTML parser now recognizes (and skip) the closing tags: ,,,..: these are
invalid tags in HTML but they still occurs quite often "in the wild" (addition by Greg Kochaniak)
- v3.18: March 20, 217: 1 addition
- added the ITCXMLNode::isClosedNode() method
- v3.19: November 10,2018: 1 fixes, 1 addition
- FIX: small memory leak fixed in IJSONPullParser when it's used unproperly
- better support for HTML files
A small tutorial
Let's assume that you want to parse the XML file "PMMLModel.xml"
that contains:
<?xml version="1.0" encoding="ISO-8859-1"?>
<PMML version="3.0"
xmlns="http://www.dmg.org/PMML-3-0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema_instance" >
<Header copyright="Frank Vanden Berghen">
Hello World!
<Application name="<Condor>" version="1.99beta" />
</Header>
<Extension name="keys"> <Key name="urn"> </Key> </Extension>
<DataDictionary>
<DataField name="persfam" optype="continuous" dataType="double">
<Value value="9.900000e+001" property="missing" />
</DataField>
<DataField name="prov" optype="continuous" dataType="double" />
<DataField name="urb" optype="continuous" dataType="double" />
<DataField name="ses" optype="continuous" dataType="double" />
</DataDictionary>
<RegressionModel functionName="regression" modelType="linearRegression">
<RegressionTable intercept="0.00796037">
<NumericPredictor name="persfam" coefficient="-0.00275951" />
<NumericPredictor name="prov" coefficient="0.000319433" />
<NumericPredictor name="ses" coefficient="-0.000454307" />
<NONNumericPredictor name="testXmlExample" />
</RegressionTable>
</RegressionModel>
</PMML>
Let's analyse line by line the following small example program:
#include <stdio.h> // to get the "printf" function
#include "xmlParser.h"
int main(int argc, char **argv)
{
// This create a new Incredible XML DOM parser:
IXMLDomParser iDom;
// This open and parse the XML file:
ITCXMLNode xMainNode=iDom.openFileHelper("PMMLModel.xml","PMML");
// This prints "<Condor>":
ITCXMLNode xNode=xMainNode.getChildNode("Header");
printf("Application Name is: '%s'\n", xNode.getChildNode("Application").getAttribute("name"));
// This prints "Hello world!":
printf("Text inside Header tag is :'%s'\n", xNode.getText());
// This gets the number of "NumericPredictor" tags:
xNode=xMainNode.getChildNode("RegressionModel").getChildNode("RegressionTable");
int n=xNode.nChildNode("NumericPredictor");
// This prints the "coefficient" value for all the "NumericPredictor" tags:
for (int i=0; i<n; i++)
printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",i).getAttribute("coefficient")));
// This create a IXMLRenderer object and use this object to print a formatted XML string based on
// the content of the first "Extension" tag of the XML file (more details below):
printf("%s\n",IXMLRenderer().getString(xMainNode.getChildNode("Extension")));
return 0;
}
To easily manipulate the data contained inside the XML file, the first operation is
to create an IXMLDomParser object (in the above example, it's named "iDom") and use it to get an instance of the class ITCXMLNode that represents the XML file in
memory. You can use:
ITCXMLNode xMainNode=iDom.openFileHelper("PMMLModel.xml","PMML");
or, if you use the UNICODE windows version of the library:
ITCXMLNode xMainNode=iDom.openFileHelper(L"PMMLModel.xml",L"PMML");
or, if the XML document is already in a memory buffer pointed by the variable "char
*xmlDoc" :
ITCXMLNode xMainNode=iDom.parseString(xmlDoc,"PMML");
This will create an object called xMainNode
that represents the first tag named PMML
found inside the XML document. This object is the top of tree structure representing
the XML file in memory. The following command creates a new object called xNode
that represents the "Header"
tag inside the "PMML"
tag.
ITCXMLNode xNode=xMainNode.getChildNode("Header");
The following command prints on the screen "<Condor>"
(note that the "<"
character entity has been replaced by "<"):
printf("Application Name is: '%S'\n", xNode.getChildNode("Application").getAttribute("name"));
The following command prints on the screen "Hello
World!":
printf("Text inside Header tag is :'%s'\n", xNode.getText());
Let's assume you want to "go to" the tag named "RegressionTable":
xNode=xMainNode.getChildNode("RegressionModel").getChildNode("RegressionTable");
Note that the previous value of the object named xNode
has been "garbage collected" so that no memory leak occurs. If you
want to know how many tags named "NumericPredictor"
are contained inside the tag named "RegressionTable":
int n=xNode.nChildNode("NumericPredictor");
The variable n now
contains the value 3. If you want to print the value of the coefficient
attribute for all the NumericPredictor
tags:
for (int i=0; i<n; i++)
printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",i).getAttribute("coefficient")));
Or equivalently, but faster at runtime:
int iterator=0;
for (int i=0; i<n; i++)
printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",&iterator).getAttribute("coefficient")));
If you want to generate and print on the screen the following XML formatted
text:
<Extension name="keys">
<Key name="urn" />
</Extension>
You can use:
IXMLRenderer iRenderer;
char *t=iRenderer.getString(xMainNode.getChildNode("Extension"),true);
printf("%s\n",t);
Note that you must NOT free yourself the memory buffer containing the returned XML string (You must NOT write
any "free(t);") : The memory buffer
containing the XML string is owned by the iRenderer
object and it will be free'd when the iRenderer object
is destroyed (i.e. when it falls "out-of-scope").
The parameter true to
the function getString() means that we want formatted output.
The Incredible XML Parser library contains many more other small usefull methods that are
not described here (The zip file contains some additional examples to explain
other functionalities and a complete Doxygen documentation about the IXMParser). These methods allows you to:
That's all folks! With this basic knowledge, you should be able to retreive easily
any data from any XML file!