Threading on Windows

Threads are becoming one of the more ubiquitous concepts in programming. Chances are quite good that you’ve a few of them before. But have you ever stopped to think about how a thread works under the hood? There are some obvious things, like allocating a stack for the thread, updating some process structures to track the threads, and so forth. But what about the fuzzy stuff that happens between the kernel bookkeeping of making a thread object and your thread entrypoint code?

The part that I want to focus on right now is the thread startup code, specifically the mechanics of ensuring the C runtime library works. What I’m going to describe is the Windows way of doing things, but I’d be surprised if POSIX threads differed drastically.

On Windows, there are two and a half ways to start a thread. CreateThread is an API exposed from Kernel32, and is the single entrypoint from userland to create a thread. _beginthreadex is an API in the CRT used to create a thread that is safe to use in conjunction with the C runtime library. These two are the primary ways to create at thread on Windows. The “half” way to create a thread is the older _beginthread API in the CRT, but that is not one newer code is supposed to use, so I’m not going to bother delving into it.

The Microsoft documentation is quite hazy on why to use _beginthreadex instead of CreateThread for CRT code. In fact, the only clear documentation comes from a Knowledge Base article from 2005. The conundrum detailed in the article alludes to some of what happens before your entrypoint is called.

The long and short of it boils down to the fact that the C Runtime Library predates the concept of threads. So there are some CRT functions which simply wouldn’t work in a multi-threaded environment. For instance, think about the errno variable. In a single-threaded application you never run into a problem. But in a multi-threaded environment, Thread A could be in the process of writing to errno while Thread B is in the process of reading from it. The only sensible thing is to make errno per-thread!

So what happens when you create a thread with _beginthreadex is that a structure is created to hold all of the per-thread data required by the CRT. This structure is then associated with the created thread via a thread-local storage slot. When the thread ends, the structure is freed. In pseduocode, it looks like this:

ptr _beginthreadex(params, callback, arg, more_params) {
  thread_data = allocate( sizeof( thread_data ) );
  init( thread_data );
  thread_data->callback = callback;
  thread_data->arg = arg;

  handle = CreateThread( params, our_start_function, &thread_data, more_params );
  return handle;
}

The first thing to notice is that the call allocates a structure that holds all of the per-thread CRT information, and it also holds the callback and arguments the user passed in. It does eventually call CreateThread because that’s the only way to make threads on Windows, but it passes in its own callback function. So what does this look like?

ulong our_start_function( args ) {
  thread_data = args;
  set_thread_local_storage( thread_data );
  __try {  // SEH try block
    _endthreadex( thread_data->callback( thread_data->arg ) );
  } __except (filter) {
    _exit( exception_code );
  }
  // never even get here.
}

This is where the magic happen — the helper callback puts the per-thread data structure in the thread local storage slot for the user thread. Only then can it call the user’s callback function. The result of the user’s callback function is passed to _endthreadex. So what does that code look like?

void _endthreadex( ulong retcode ) {
  thread_data = get_thread_local_storage();
  if (thread_data) {
    destroy( thread_data );
    free( thread_data );
  }
  ExitThread( retcode );
}

Ending the CRT thread finds the thread local storage for the thread data, and if it found it, it frees all of the associated data. But eventually, the Win32 API ExitThread is called.

There is also a very good article from 1999 in the Microsoft Systems Journal which describes this process as well.

The long and short of it is: if you don’t call _beginthreadex and _endthreadex when using CRT functions, you can leak data small amounts of data. If you called CreateThread and attempt to access rand (for instance), this thread-local structure won’t be located. So one will be created and added to the thread local storage for you. So things appear to still work. But the key problems are: if you call ExitThread yourself (or just run off the end of your callback function), then this structure will not be freed. What’s more, if you use the CRT signal function, you will crash because there’s no SEH block to handle the signal.

But I’ve found some of this information to be dated, too. For instance, one thing which I noticed is that the CRT includes a DllMain which will automatically allocate and deallocate the thread data in the thread attach and detach messages. So calling ExitThread will still do the right thing if CRT data has been used.

At the end of the day, though, this turns out to still be a quite important distinction that a lot of Win32 programmers don’t know about (including myself, until recently!). For instance, if you use the static CRT library, then the CRT functions will still work since all of the CRT functions allocate the thread local data on demand as-needed, but because there’s no DllMain, exiting the thread will leak data. (Of course, there are a lot of other problems to be encountered when statically linking the CRT beyond simple leaks!) And no matter what, the signal function will crash unless you call _beginthreadex because the structured exception handling filter will not be in place.

Will your code be horribly unstable and cause massive failures if you use CreateThread and call a CRT function? Not likely. All CRT functions that use thread-specific data know how to create that data if need be. And many CRT functions are already thread-safe and don’t require the thread specific data anyway. However, it does still cause per-thread leaks in some situations, so if you make a lot of threads, you will get a lot of leaks. And if you call signal, you will get crashes. So the recommendation still stands: use _beginthreadex if your thread will be calling CRT functions, just to be on the safe side. And do not call ExitThread, because that may not allow the CRT to free its structures — call _exitthreadex instead.

This entry was posted in Win32 and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *