IncredibleXMLParser
3.05
|
Copyright (c) 2013, Frank Vanden Berghen - All rights reserved.
See the file AFPL-license.txt about the licensing terms
The Incredible XML Parser library is an advanced non-validating XML parser written in ANSI C++ for portability.
The main objectives of the Incredible XML Parser library are:
The Incredible XML Parser library includes 2 parsers: It has:
The Incredible XML DOM Parser, the Incredible XML Pull Parser and the Incredible JSON Pull Parser can all process terabyte-size XML/JSON files in a few hours on commodity hardware with very low memory consumption (i.e. less than a few megabyte).
The three parsers (the Pull Parser and the DOM parser) generate strings either in "char*" or in "wchar_t*" mode. In "char*" mode, the Incredible XML Parser supports nearly any currently known character encodings (and it's very easy to add new ones if required). In "wchar_t*" mode, the Incredible XML Parser manipulates utf-16 strings. The Incredible XML Parser also automatically converts between character encodings (e.g. it automatically converts from "utf-8" to "utf-16" when using the "wchar_t* version of the library). The Incredible XML Parser is the only Small Foot-print, Non-validating XML parser that supports a very wide range of different character encodings.
The three parsers (the Incredible Pull Parsers and the Incredible DOM parser) are working on a stream of data: This means that you don't need to load into memory the complete XML/JSON file (or the complete XML/JSON String): You only need to provide a function (i.e. the "read" function of a IXMLReader object) that returns different small "chunks" of the XML/JSON stream. This has several advantages:
The Incredible XML&JSON Pull parsers are 100% "in-place" parsers. This means that they do NOT copy strings: they only initializes different pointers to the memory buffer containing the XML/JSON data (There is however one inevitable memory copy when converting between different character encodings: for example when the Pull Parser is forced to convert the characters from "utf-16" to "utf-8"). "In-place" parsers are a lot faster because they do not require copying the whole data into separate buffers. The Incredible XML&JSON Pull parsers are thus one of the fastest XML parser available (and they might even be the fastest).
You can configure the size of all the memory buffers used inside the Incredible XML&JSON Parsers. When you setup small buffer sizes, it reduces the memory consumption of the parser but it also usually slightly increases the computation time. The default buffer sizes are optimal to have a good speed on a normal-size PC.
...ensures that the Incredible XML/JSON Pull Parser is the XML/JSON Parser with the SMALLEST memory consumption amongst all parsers.
All the strings returned by the XML/JSON Pull Parser are zero-terminated so that you can directly and very easily use them. For example, you can write:
because "getName()" returns a zero-terminated char* (or wchar_t*). The Incredible XML/JSON Parser is the only "in-place" Pull parser that returns zero-terminated strings without penalty hit (i.e. without copying the whole string into a separate buffer). It's thus a lot more "usable" than all other "in-place" Pull Parsers.
The XML DOM parser is able to "hot start" to create a node tree out of a sub-section of the original XML file. This means that, if you have a XML File such as this one:
...you will typically call the XML DOM parser 3 times (i.e. one time for each customer). When the DOM parser "hot starts", it always re-uses the same RAM memory space as the previous call so that no additional memory allocations occurs. It is thus extremely fast. Since we are building in memory a XMLNode structure that only contains ONE customer at-a-time, the memory consumption is very small (and independent of the total size of the XML file!). The "hot start" functionality is unique and very important because it allows us to use a very flexible DOM-style Parser on UNLIMITED XML file size (see example7()).
The Incredible XML Parser is the only DOM-Style parser that is able to work on UNLIMITED XML/JSON file size (all other DOM-Style parsers are always limited to file size smaller than a few MegaByte). The Incredible XML Parser is thus the only parser that allows you to very easily analyze very complex XML/JSON files (thanks to the easy-to-use DOM-style parser) of UNLIMITED size.
The main bottleneck in any DOM-Style parser is always the memory allocations. If you remove this bottleneck (as inside the Incredible XML Parser), you obtain a parser that is between 10 to 100 times faster (depending of the structure of the XML/JSON). This explains why the Incredible XML DOM parser is also the fastest DOM-Style parser currently available. The Incredible XML DOM parser does not perform any memory allocations to build the different node trees (Thanks to the "Hot Start" functionality)(i.e. In the above example: There are no memory allocations to build each XMLNode structure for each of the customer). The extreme speed of the Incredible XML DOM parser allows to easily manipulate extremely large XML files (i.e. terabyte XML files are processed in a few hours on commodity hardware).
The points 8 to 13 here above are very UNCOMMON inside small foot-print, non-validating XML Parsers.
You can follow a simple Tutorial to know the basics...
By default, the Incredible XML DOM parser creates a tree of ITCXMLNode. Because of the "hot start" functionality, this tree will disappear at the next call to the DOM parser (because the Incredible DOM parser always re-uses the same memory space to store the tree to avoid any memory allocation). The name ITCXMLNode is the acronym of "Incredible Transient Constant XMLNode". "Transient" means that the tree disappear at each call to the DOM parser. "Constant" means that you cannot change the tree (i.e. it's read-only) (e.g. you cannot add or remove any child nodes).
You can always convert a ITCXMLNode to a ICXMLNode (note that the 'T' letter in ICXMLNode is missing because it's not "transient" anymore), so that the tree obtained with the DOM parser still remains in memory after a new call to the DOM parser.
You can always convert a ITCXMLNode (or a ICXMLNode) to a IXMLNode (note that the 'T' and 'C' letters in IXMLNode are missing because this object is not "transient" nor "constant" anymore). You can edit/update IXMLNode's using the classical, well-known functions (e.g. the function addChildNode(), addAttribute(), deleteNodeContent(), etc.). The IXMLNode class is 100% compatible with the old, well-known XMLNode class from the old XMLParser library.
When using the Incredible DOM parser, you have access to 3 types of XMLNodes:
For most operations, these 3 type of XMLNodes are interchangeable (however only the IXMLNode support "editing" operations). The main difference between these 3 XMLNode classes comes from the way to manage the memory allocations.
The library is composed of only two files: IXMLParser.cpp and IXMLParser.h. These are the ONLY 2 files that you need when using the library inside your own projects.
All the functions of the library are documented inside the comments of the file IXMLParser.h. These comments can be transformed in full-fledged HTML documentation using the DOXYGEN software: simply type: "doxygen doxy.cfg"
By default, the IXMLParser library uses (char*) for string representation.To use the (wchar_t*) version of the library, you need to define the "_UNICODE" preprocessor definition variable (This is usually done inside your project definition file) (If you are using Visual Studio, then this is done automatically for you by the IDE).
Some very small introductory examples are described inside the Tutorial file IXMLParser.html
Some additional small examples are also inside the file IXMLTest.cpp (for the "char*" version of the library) and inside the file IXMLTestUnicode.cpp (for the "wchar_t*" version of the library). If you have a question, please review these additional examples before sending an e-mail to the author.
To build the examples:
The file IXML_Autoexp.txt contains some "tweaks" that improve substantially the display of the content of the ITCXMLNode, ICXMLNode & IXMLNode objects inside the Visual Studio Debugger. Believe me, once you have seen inside the debugger the "smooth" display of the ITCXMLNode objects, you cannot live without it anymore!
The Incredible XML Parser library is designed to minimize the quantity of memory allocations. As long as you are using ITCXMLNode objects or ICXMLNode objects, the number of memory allocations remains extremely low. However, manipulating IXMLNode objects requires many allocations or re-allocations. Inside Visual C++, the "debug versions" of the memory allocation functions are very slow: Do not forget to compile in "release mode" to get maximum speed.
When I had to debug a software that was using the IXMLNode objects, it was usually a nightmare because the library was really slow in debug mode (because of the slow memory allocations in Debug mode). To solve this problem, during debugging sessions of codes that include IXMLNode, I am now using a very fast DLL version of the IXMLParser Library (the DLL is compiled in release mode). Using the DLL version of the IXMLParser Library allows me to have lightening XML parsing speed even in debug! Other than that, the DLL version is useless: In the release version of my tool, I always use the normal, ".cpp"-based, IXMLParser Library (I simply include the IXMLParser.cpp and IXMLParser.h files into the project).