A blog about in-depth analysis and understanding of programming, with a focus on C++.

Thursday, September 20, 2007

Parsing, C++ Style. Part 1

I'm going to let you in on a secret: I hate rote busywork programming. I can't stand boring, trivial code. Oh, I know it needs to be done, but if I've done it 20 times before, odds are I'm going to find a way to keep from having to do it a 21st time. That's one of the things that's nice about using C++.

One task I detest with a passion is writing a parser. We've all had to do it. Load a text file, tokenize it based on some pattern or whatever, look for the data you're interested in, check for errors in formatting, etc. It's easy and boring. And, more importantly, highly error prone.

The latter is a real sticking point. The only thing I hate more than writing a praser is having to debug one. Adding to that is the fact that debugging a parser is not like unit testing or something of that nature; it's a totally input driven process. If you forget a particular case, or didn't anticipate something, it will come back to bite you.

Fact is, I shouldn't have to parse code manually. I should have some API or library or something that I can apply to a text file and read it. Worst case, I should be able to define the format of the file, run the file through the parser with that format, and get a parse tree representation of the file, or an error if the file is not formatted properly. Best case, instead of a parse tree, my code simply gets called during the act of parsing the file in question.

Now, there are two utilities in particular that I personally rely upon for 95% or more of my parsing needs. The first is LibXML2. It's a C-based library (though there's an unpleasant C++ front end with an obnoxious license available, if you really need it), so it doesn't qualify as C++ Limit Science.

So, I lean on XML a lot. However, though XML is a nice language for the person writing the document, it's not particularly nice for the person reading it. Even with LibXML2, you have to do a lot of rote parsing work.

For formats that tend to be generated from code rather than by an external user, I prefer a combination of Lua and Luabind. Lua is a very easy to use scripting language, but the origin of Lua was in configuration languages. As such, Lua makes a handy file format. All you do is have the Lua script store its data as a large table, likely with nesting. Imagine a gigantic arbitrary struct with any data you can imagine. It can be passed around as plain text, but it can also be compiled into Lua bytecode for size purposes. And thanks to Luabind (a C++ library), actually accessing the contents of the object is much easier than with the Lua C API. It is very easy to generate files and it is likewise easy to read them. And if someone wrote the file in a syntactically erroneous way, Lua will simply report an error at compile time.

Taken together, they cover most of my parsing needs. XML is used for things the user will modify or create, and Lua is used for most everything else.

But then comes the big problem. What do you do when you have no control over the file format?

This is more likely than you might think at first glance. HTML and XML may be derived from the same source (SGML), but they are not compatible. And HTML certainly isn't Lua. Or maybe you're trying to write a scripting system yourself. Unless you're going to force your users to script in XML (yuck), you will need to write a parser of some kind. Or maybe you need to be able to parse some simple strings that the user enters, such as for a text adventure or some such.

And that's where Boost comes in. Or rather, Boost.Spirit.

Spirit is a very large, header-only library, but it's founded on a simple predicate: make text parsing easy. It is designed for low-to-medium scale parsing; high scale parsing, as for a highly grammatically complex language like C++, is at the very far end of Spirit's capabilities. It can do it, but you will certainly not like what it does to your compile times.

This is to contrast Spirit with it's primary alternatives, the preprocessing-based Lex/Yacc and Flex/Bison. These alternatives require the user to write a grammar in a special syntax and add specialized preprocessing commands into their build pipeline to generate C files for doing the parsing. They're functional, but rather arcane, requiring external executables and so forth.

Spirit, by contrast, leverages a preprocessor that you already have: your C++ template preprocessor. Spirit is a template metaprogramming library that allows the user to define a grammar using a syntax that is not unlike Enhanced Backus Naur Form. It also allows the user to provide semantic actions, functions that will be called when a certain parsed element is reached.

Higher-level Spirit functions, built on the base functionality described above, allow for the creation of an entire parse tree of the document, defined by special semantic actions. And of course, Spirit has the Phoenix functional programming library in it (which it uses internally), which makes it easier to build semantic actions for various bits of functionality.

The core foundation of Spirit isn't really a parser per se: it is a tokenizer and recognizer. It is, in essence, a schema language. It takes a string and a grammar and returns true if the string fits the grammar and false otherwise. Along the way, it will call various user-defined functions on the parsed text if it matches.

Here is a very simple example of a Spirit parser:
using boost::spirit;

bool RecognizeNumberSequence(const char *strInput)
{
return parse(strInput, real_p >> *(',' >> real_p), space_p).full;
}

This will return true if strInput is a comma-separated list of real numbers.

The second input to the function is a Spirit parser. The expression real_p >> *(',' >> real_p) produces a parser object that can be used as input to the parse function. The variable real_p is a predefined Spirit parser that parses real numbers in all of their various forms.

The object space_p is a parser recognizing spaces. Used as the third argument to boost::spirit::parse(), it represents the "skip parser". This parser tells the parse function how to tokenize the input. That is, what to ignore when parsing. This keeps the main grammar cleaner.

Let us break down the main parser expression, "real_p >> *(',' >> real_p)". The '>>' operator means that the operand on the left will occur before the one on the right. The '()' means what it seems like; the grouping operator. The '*' operator (C++'s dereference, not multiply) means that the operand will be repeated zero or more times.

So, this parser means that there will be a real number, followed by zero or more repetitions of a comma followed by a real number. Basically, a comma-separated list of real numbers. And the space_p skip parser means that spaces can occur between any of these elements, but not within a real number.

As mentioned, this is not a terribly useful function, as it does not actually do anything with what it parses. It simply detects whether the input is well-formed as stipulated by the two parsers. In the next part, we will deal with actually doing something while parsing.

No comments: