Semantic Whitespace

No, this isn’t a blog posting about Python and its use of whitespace to denote statement blocks. This is actually a post about when whitespace matters in C++. It doesn’t happen particularly often, but whitespace can be important.

To understand why it may be important, you need to have an understanding of how lexers typically work. The general purpose of a lexer is to take a stream of source code as input, and spit out a stream of tokens as output. These tokens are a generalized representation of the source code which are used as input into a parser. The parser can take these tokens and match them against patterns (the grammar) to further compile your program. At least, this is the 50,000 foot overview of how a compiler works. Armed with this basic knowledge, we can continue the discussion.

Generally speaking, whitespace isn’t particular important to C++ — it gets eaten by the lexer, so the parser never sees it. However, in order for whitespace to be ignored by the lexer, the language needs to follow some simple rules. The rule of thumb is: the lexer is greedy. That means the lexer will prefer longer patterns to shorter ones when attempting to tokenize text. That means the token “aa” is generated instead of two tokens “a” and “a”. This makes sense, if you think about it. Or else, how would the parser ever be able to make sense of a++ + ++b? Without the lexer being greedy, that expression would end up being seven tokens long (identifier,+,+,+,+,+,identifier) instead of five (identifier,++,+,++,identifier) and the parser’s job would be significantly more difficult.

Generally speaking, this doesn’t matter to programmers. The lexer does its thing, the parser is happy with the output, and all is right with the world. However, there is at least one case where understanding this is important to the programmer: templates. In C++, template syntax uses < and > to bracket template argument lists. For simple templates, this isn’t an issue. However, for a template that includes a secondary template, this is actually quite important.

Consider these two declarations of a vector of vectors of strings:
1) std::vector< std::vector< std::string > > foo;
2) std::vector> foo;

Notice the closing angle brackets, and the fact that there are two of them? Given that the lexer is greedy, these two declarations lex differently.
1) identifier, ::, identifier, <, identifier, ::, identifier, <, identifier, ::, identifier, >, >, idenfitier;
2) identifier, ::, identifier, <, identifier, ::, identifier, <, identifier, ::, identifier, >>, idenfitier;

This is where whitespace becomes semantically important in C++. If you fail to have the space between the angle brackets, the lexer will produce the right-shift operator (>>), which the parser will not interpret properly. Thus, the declaration is not strictly legal without the space between the two angle brackets.

Some compilers are smart enough to handle this erroneous syntax, however, they are not required to. It is certainly not something you should rely on. To be safe, you should always put a space between the template argument angle brackets. However, all is not lost. The latest C++ specification (typically referred to as C++0x) addresses this common pitfall by requiring parsers to interpret multiple right angle brackets as closing the template argument list when is it reasonable.

This entry was posted in C/C++ and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *