A blog about in-depth analysis and understanding of programming, with a focus on C++.

Thursday, September 20, 2007

Parsing, C++ Style. Part 1

I'm going to let you in on a secret: I hate rote busywork programming. I can't stand boring, trivial code. Oh, I know it needs to be done, but if I've done it 20 times before, odds are I'm going to find a way to keep from having to do it a 21st time. That's one of the things that's nice about using C++.

One task I detest with a passion is writing a parser. We've all had to do it. Load a text file, tokenize it based on some pattern or whatever, look for the data you're interested in, check for errors in formatting, etc. It's easy and boring. And, more importantly, highly error prone.

The latter is a real sticking point. The only thing I hate more than writing a praser is having to debug one. Adding to that is the fact that debugging a parser is not like unit testing or something of that nature; it's a totally input driven process. If you forget a particular case, or didn't anticipate something, it will come back to bite you.

Fact is, I shouldn't have to parse code manually. I should have some API or library or something that I can apply to a text file and read it. Worst case, I should be able to define the format of the file, run the file through the parser with that format, and get a parse tree representation of the file, or an error if the file is not formatted properly. Best case, instead of a parse tree, my code simply gets called during the act of parsing the file in question.

Now, there are two utilities in particular that I personally rely upon for 95% or more of my parsing needs. The first is LibXML2. It's a C-based library (though there's an unpleasant C++ front end with an obnoxious license available, if you really need it), so it doesn't qualify as C++ Limit Science.

So, I lean on XML a lot. However, though XML is a nice language for the person writing the document, it's not particularly nice for the person reading it. Even with LibXML2, you have to do a lot of rote parsing work.

For formats that tend to be generated from code rather than by an external user, I prefer a combination of Lua and Luabind. Lua is a very easy to use scripting language, but the origin of Lua was in configuration languages. As such, Lua makes a handy file format. All you do is have the Lua script store its data as a large table, likely with nesting. Imagine a gigantic arbitrary struct with any data you can imagine. It can be passed around as plain text, but it can also be compiled into Lua bytecode for size purposes. And thanks to Luabind (a C++ library), actually accessing the contents of the object is much easier than with the Lua C API. It is very easy to generate files and it is likewise easy to read them. And if someone wrote the file in a syntactically erroneous way, Lua will simply report an error at compile time.

Taken together, they cover most of my parsing needs. XML is used for things the user will modify or create, and Lua is used for most everything else.

But then comes the big problem. What do you do when you have no control over the file format?

This is more likely than you might think at first glance. HTML and XML may be derived from the same source (SGML), but they are not compatible. And HTML certainly isn't Lua. Or maybe you're trying to write a scripting system yourself. Unless you're going to force your users to script in XML (yuck), you will need to write a parser of some kind. Or maybe you need to be able to parse some simple strings that the user enters, such as for a text adventure or some such.

And that's where Boost comes in. Or rather, Boost.Spirit.

Spirit is a very large, header-only library, but it's founded on a simple predicate: make text parsing easy. It is designed for low-to-medium scale parsing; high scale parsing, as for a highly grammatically complex language like C++, is at the very far end of Spirit's capabilities. It can do it, but you will certainly not like what it does to your compile times.

This is to contrast Spirit with it's primary alternatives, the preprocessing-based Lex/Yacc and Flex/Bison. These alternatives require the user to write a grammar in a special syntax and add specialized preprocessing commands into their build pipeline to generate C files for doing the parsing. They're functional, but rather arcane, requiring external executables and so forth.

Spirit, by contrast, leverages a preprocessor that you already have: your C++ template preprocessor. Spirit is a template metaprogramming library that allows the user to define a grammar using a syntax that is not unlike Enhanced Backus Naur Form. It also allows the user to provide semantic actions, functions that will be called when a certain parsed element is reached.

Higher-level Spirit functions, built on the base functionality described above, allow for the creation of an entire parse tree of the document, defined by special semantic actions. And of course, Spirit has the Phoenix functional programming library in it (which it uses internally), which makes it easier to build semantic actions for various bits of functionality.

The core foundation of Spirit isn't really a parser per se: it is a tokenizer and recognizer. It is, in essence, a schema language. It takes a string and a grammar and returns true if the string fits the grammar and false otherwise. Along the way, it will call various user-defined functions on the parsed text if it matches.

Here is a very simple example of a Spirit parser:
using boost::spirit;

bool RecognizeNumberSequence(const char *strInput)
{
return parse(strInput, real_p >> *(',' >> real_p), space_p).full;
}

This will return true if strInput is a comma-separated list of real numbers.

The second input to the function is a Spirit parser. The expression real_p >> *(',' >> real_p) produces a parser object that can be used as input to the parse function. The variable real_p is a predefined Spirit parser that parses real numbers in all of their various forms.

The object space_p is a parser recognizing spaces. Used as the third argument to boost::spirit::parse(), it represents the "skip parser". This parser tells the parse function how to tokenize the input. That is, what to ignore when parsing. This keeps the main grammar cleaner.

Let us break down the main parser expression, "real_p >> *(',' >> real_p)". The '>>' operator means that the operand on the left will occur before the one on the right. The '()' means what it seems like; the grouping operator. The '*' operator (C++'s dereference, not multiply) means that the operand will be repeated zero or more times.

So, this parser means that there will be a real number, followed by zero or more repetitions of a comma followed by a real number. Basically, a comma-separated list of real numbers. And the space_p skip parser means that spaces can occur between any of these elements, but not within a real number.

As mentioned, this is not a terribly useful function, as it does not actually do anything with what it parses. It simply detects whether the input is well-formed as stipulated by the two parsers. In the next part, we will deal with actually doing something while parsing.

Cast in Gold

Casting. Casts are that dirty little thing that everyone knows not to use but everyone uses it more often than ought to.

C introduced us to one kind of cast. You have a thing of type A, and you can try to turn it into a thing of type B. Simple construct for a simple language.

C++ created more casts. But there's a funny thing about C++ casts; nobody likes them. Articles have been written about why they should be avoided and why they were a bad idea for the language. Few C++ programmers prefer them over C-style casts.

This article will explain what these C++-style casts are and why you should use them a lot more often that you probably do.

All of these casts share a common syntax. It reads like, cast_name<TypeToConvertTo>(expressionToCast); No matter how small "cast_name" is, it will always be bigger than the nothing that C-style casts use. C++-style casts draw attention to themselves by being large and ugly.

This was supposedly done intensionally as a way to discourage their use. Unfortunately, it worked; that's why most everybody just uses C-style casts. The idea was that it would discourage the use of casting in general, but that was madness.

There are 4 kinds of C++ casts. There is the static_cast, const_cast, reinterpret_cast, and dynamic_cast. Each of them is meant to be used under specific circumstances.

The easiest two to defend are reinterpret and dynamic. This is because neither of them have a direct analog with C-style casts.

If you have a 32-bit float value, and you want to get it's bit-pattern as an unsigned 32-bit integer, in C, you have to do this:
unsigned int32_t iFloatBits = *(unsigned int32_t *)&fValue;

Not only is this somewhat obtuse as to what's happening (you're pretending a float-pointer is an integer pointer), it requires that fValue have a memory address, since you have to dereference it. This is highly unnecessary; what you're really asking is for the compiler to bring fValue's bits from a float register into an int register. This is very simple and just about every CPU can do it. But in C, you can't actually say it like that.

In C++, you have:
unsigned int32_t iFloatBits = reinterpret_cast<unsigned int32_t>(fValue);

The only downside is the sheer length of the expression. But it's a lot more obvious what's going on, so long as the reader understands what C++ casts do.

As for dynamic_cast, there is no C analog because C doesn't have inheritance. The purpose of this operation is to take a pointer to a base class and turn it into a derived class. Now, of course, that particular base class instance may not actually be that particular derived class. So a regular C-style cast operation is right out, since they are compile-time operations.

See, dynamic_cast isn't a cast at all really; it's a function call. One that can fail. That is, a particular cast can return NULL if the cast operation cannot work on the specific instance. A dynamic cast is what allows some degree of runtime-type identification. You use it when you need to break the OOP contract and consider a base class to be a derived class.

So, it is obvious that both of these are superior to their C analogs (and one doesn't even have a C analog, so it is a priori superior to nothing). However, what of the other two?

Well, the main advantage of using const_cast is just error checking. A const cast can only add or remove const-ness from a type. As such, you cannot accidentally convert the type to something else by entering the wrong type. This is useful.

Also, removing const-ness (which is 99.9% of all const_cast uses, since adding const-ness is implicit) is rather rare. It usually happens when interfacing with code that you did not write. Indeed, doing this kind of cast suggests a design flaw or that something ought to be made mutable. Outside of that, the times when this kind of cast is truly needed are few indeed. So the larger size of the conversion expression is not a substantial problem.

Which brings us to static_cast. This is the cast that I can recommend the least. It gives protection from accidentally doing something that reinterpret or const casts should be used for, but that's about it. And so long as we assume that the programmer is intelligent, and therefore is not going to arbitrarily be doing these kinds of conversions, then we should assume that each static cast operation is actually necessary. And if the programmer actually is not intelligent, the ugliness of static_cast will not be a deterrent, since they can always use C-style casts anyway.

So, either way, it generally better to use C-style casts when you might use a static_cast.

Wednesday, September 12, 2007

One RAII to Rule Them All

RAII: Resource Acquisition Is Initialization.

Perhaps the most useless acronym ever.

Why? Because it only tells you half of what it is. Perhaps even less.

The part of RAII that is in the name is the concept of wrapping all resources in objects. Specifically, the constructor of objects.

What the name misses is the basic construct of RAII.

Consider the following code:
char *LoadAndProcessData(const char *strFilename)
{
FILE *pFile = fopen(strFilename, "rb");
if(!pFile)
return NULL;

char *pFileData = new char[50];

fread(pFileData, 50, pFile);
fclose(pFile);

ProcessData(pFileData);

return pFileData;
}

The points of failure in that code are legion. 'fopen' could fail, returning NULL. 'fread' could fail, returning NULL. Even 'new' could fail, either returning NULL or doing everybody's favorite: throwing an exception.

The first point, the function handles. Presumably, the calling function will look at the return value, check for NULL, and if it sees NULL, then it will do something useful.

The second point, however, gives rise to an unhandled error. If 'fread' fails, the function 'ProcessData' will get uninitialized (or partially initialized) memory. It may fail, throw an exception, or silently succeed but have bad data. None of these are good answers.

If 'new' fails, throwing an exception means that we fail. Granted, 'new' will only throw if we are actually out of memory, which isn't a terribly recoverable error. But if it does throw, we will immediately leave the function, without calling 'fclose'. That's bad.

File handles from a process tend to be cleaned up OK by the OS, mainly because so many programs are bad about doing it themselves. But it could have been a global resource, like a piece of memory to be shared between two processes (say, the clipboard). Throwing an exception would keep us from properly filling that memory out. Or closing the handle correctly, resulting in a leak. And so on.

It is precisely here where C-in-C++ programmers give up on exceptions entirely. After all, if 'new' didn't throw, we wouldn't have a problem.

However, there is a much more useful solution to this, one that doesn't turn its back on useful language features. The reasoning behind wanting to preserve exceptions has been discussed. What we will focus on is the technique that allows us to avoid abandoning exceptions. As well as giving us so much in return.

The reason that the acronym RAII is wrongheaded is that it isn't really the resource acquisition that's the issue; it's controlling when the resource is released. After all, 'fopen' does it's job; it creates a file handle well enough. The problem has always been that 'fclose' isn't guaranteed to be called.

The C++ specification dictates that, when an exception is caught, the stack will be unwound. All stack variables will have their destructor called, and it will be called in the proper order. Since this coincides with the exact moment we want to destroy the file handle, it makes sense to put the file handle on the stack.

RAII principles are as follows:
  1. Constructors are required to do actual work. Any and all things that could be deemed resource acquisition will only be done in object constructors.
  2. Any and all resource releasing will be done in object destructors.
  3. RAII-style objects must have appropriate copy constructors and copy assignment operators, if the object is copyable. If it is not, then it should forcibly be made non-copyable (by deriving from boost::non_copyable, for example).
  4. If there is a resource that cannot be created directly by a constructor, it must be created and immediately stored in an object with a destructor that will release the resource properly.
A few simple rules create a fundamentally different coding style.

What qualifies as "resource acquisition"? At a bare minimum, a resource is something you need to have cleaned up. Hence the backwards acronym; we define resources by wanting to get rid of them, not by how we acquire them.

Immediately, it becomes obvious that the value returned by operator 'new' qualifies. After all, memory is most definitely something we need to have cleaned up. So as per rule #4, the results of 'new' under RAII programming style must be store din something that will guaranteably call 'delete' when it is finished. We have already seen such an object: boost::shared_ptr. Shared pointers are reference counted, which doesn't qualify in terms of the guarantee, since circular references will prevent the resource from being collected. But they're close enough. There are also std::auto_ptr's and a few others.

So dynamically allocated memory is a resource. Files are resources. The GUI system windows, drawing tools, etc, are resources. And so on.

However, RAII is actually quite viral. Unless your entire codebase operates under RAII principles (or the non-RAII portions exist in isolated chunks at the bottom of the stack. Leaf-code), exceptions cannot be entirely safe. For our simple example of file openning, we can easily make it perfectly safe by creating a File object that acts as a RAII wrapper over the C FILE handle. The constructor opens the file, throwing an exception on failure, and the destructor closes it. For the simple case, it works just fine.

However, if you use that object in a non-RAII fashion (allocate it dynamically and store a pointer on the stack), you lose the guarantee. Which means that your entire codebase must hold everything allocated in a RAII fashion to be perfectly safe in throwing exceptions.

There is an implicit 5th rule as well. Because constructors are the only place where resource acquisition is to happen, they also must throw exceptions if the acquisition fails. That is the only way to signal the non-creation of an object. Generally, releasing a resource can't provoke a failure, so it is not much of an issue.

This is generally where the C-in-C++ programmers who stuck around decide to leave, as it requires a new coding style.

What gets lost are the benefits. After all, if you're being forced to use shared_ptr everywhere, memory leaks will be a lot less likely. Circular references may happen, but that is what weak_ptr is for. Your code will generally be very difficult to break from a resource perspective.

Also, exceptions are a very useful piece of functionality. They can be used as a complex piece of flow control (but only sparingly, and only when it really matters). And, as mentioned earlier, they have a vital role in reporting certain kinds of errors.

In general, it is a more rigid programming style, requiring, particularly for shared_ptr relationships, a degree of forethought in the design phase. But it can save a lot of time dealing with errors, memory leaks, and other concerns.