Understanding Undefined Behavior

One of the harder concepts for people to understand in C++, in my opinion, is “behavior.” In C++, the language has some very specific wording for what the various behaviors are, and I’ve seen a lot of people get them mixed up or misunderstand the meaning when brought up.

From the C++11 specification:

1.3.10 [defns.impl.defined]
implementation-defined behavior
behavior, for a well-formed program construct and correct data, that depends on the implementation and that each implementation documents

1.3.12 [defns.locale.specific]
locale-specific behavior
behavior that depends on local conventions of nationality, culture, and language that each implementation documents

1.3.24 [defns.undefined]
undefined behavior
behavior for which this International Standard imposes no requirements [ Note: Undefined behavior may be expected when this International Standard omits any explicit definition of behavior or when a program uses an erroneous construct or erroneous data. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Many erroneous program constructs do not engender undefined behavior; they are required to be diagnosed. —end note ]

1.3.25 [defns.unspecified]
unspecified behavior
behavior, for a well-formed program construct and correct data, that depends on the implementation [ Note: The implementation is not required to document which behavior occurs. The range of possible behaviors is usually delineated by this International Standard. —end note ]

1.3.26 [defns.well.formed]
well-formed program
C++ program constructed according to the syntax rules, diagnosable semantic rules, and the One Definition
Rule (3.2).

That’s the official definition of the various types of behaviors you will see in standards-compliant C++. But it may not be immediately obvious as to what these mean “in the real world” to all programmers. So let’s take a look at each one of these behaviors, how you might run into them, and what they mean in a more practical sense.

implementation-defined behavior

Implementation-defined behavior relates to behavior that is not explicitly spelled out in the language specification, but is spelled out in one particular compiler’s documentation. People rely on implementation-defined behavior all the time. However, there is a “smart” way to do this, and a “going to run into troubles” way to do it.

The smart way to handle implementation-defined behaviors are to use preprocessor macros to restrict the implementation-defined behaviors to the specific compiler (or compilers) which are documented to support it. For instance, one such behavior that comes immediately to mind on Windows is the __declspec keyword for exporting and importing functions in a library. __declspec is not part of the standard C++ specification, but it is a language extension that Microsoft documents explicitly. The behavior of __declspec is therefore implementation-defined, and the best way to handle it is to check for Visual Studio’s compiler macro (note, other compilers may also support __declspec, so you could extend the macro if need be).

Some of the more common compiler macros are: _MSC_VER, __GNUC__, __GNUG__, __CYGWIN__ and __clang__. More can be located here.

So one of the ways you can run into trouble is to fail to brace your implementation-defined behaviors with a compiler macro. This is a problem when you decide to use a different compiler for your source base. While it may be tempting to think “I’ve always used Brand X compiler before, so why would I care?”, keep in mind that things change. At one of my previous jobs, we had a very large source base that had been using Metrowerks CodeWarrior for over a decade. Then Metrowerks discontinued the desktop IDE, Apple switched from PEF to Mach-O and from PPC to x86 so we were left with a lot of implementation-defined behaviors that needed to be found and dealt with.

Another way you can run into troubles with implementation-defined behaviors is when you rely on the compiler vendor’s interpretation of the language specification. For instance, the search criteria for #include <> vs #include “” is left up to the compiler, or the way variable argument lists are passed to a function with an ellipsis ([expr.call]). These are not things you should really be relying on and are easy to forget when you have a single-compiler source base.

unspecified behavior

Unspecified behavior is closely related to implementation-defined behavior in that it’s something which is not specified by the specification, but is imposed by the compiler. However, it is behavior that the compiler authors do not document the behavior.

There are no parts of the specification which are defined as unspecified behavior. This is probably one of the most worrisome of the behaviors to me. Since the specification does not call out what things are unspecified (how could it?), you have to stumble upon them yourself. And usually, you don’t figure out what unspecified behaviors are in your application until you switch compilers or compiler versions. Your best bet is to use a static analyzer tool such as Lint to help you track down areas of concern.

locale-specific behavior

This is not one you run into that often, but there are some locale-specific behaviors. For instance, the execution character set and execution wide character set are actually locale-specific. However, there’s nothing you can do to avoid this aside from being aware of it. Don’t assume that wchar_t is UTF-16 (it’s UTF-32 on OS X and Linux by default!). This seems to be the biggest issue anyone runs into with locale-specific behavior.

Note that to a certain degree, locale-specific behavior will also be implementation-defined behavior. The standard puts some definition out for what the behavior should be, but ultimately the compiler vendor makes the choice for you.

undefined behavior

To a large degree, undefined behavior is the way in which C and C++ compilers are able to produce optimized applications. It is behavior in the application which should not appear in proper source code, and so the compiler can make assumptions about how to treat it. It is a sort of unspecified behavior, except at the language level instead of at the compiler level.

When your code ventures into an area of undefined behavior, it means the entire validity of your application is considered undefined! So the compiler could decide to reformat your computer, or crash, or correct your mistake, or anything else it wants. The reason the compiler can use this to implement optimizations isn’t due to what happens when you perform undefined operations — it’s what the compiler can assume you will never do in a valid application.

For instance, it is considered an undefined behavior to dereference a NULL pointer. This allows the compiler to do some quite surprising things. Eg)

void foo( int *ptr ) {
  int i = *ptr;

  if (ptr) {
    ::printf( "%d\n", i );
  }

With optimizations turned on, the compiler can skip the if check entirely! Because the code is dereferencing a NULL pointer, the compiler can assume that the if check is redundant (after all, your code is undefined once you dereference null).

At the end of the day, what we all strive for is a well-defined program. That’s a program without undefined or unspecified behaviors. However, it can be difficult to get there by yourself. Fixing any compiler warnings in your code, and using static analysis tools like Lint or PVS Studio will go a long way towards ensuring your applications will behave by design.

This entry was posted in C/C++ and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *