Tail Calls

A tail call is a specific pattern of source code where the last instruction executed in a method is another function call. For instance:

int last( int i );
int second( int i );
 
int first( int i ) {
  int j = second( i );
  return last( j );
}

In this code, the call to last is considered a tail call because there are no further instructions to execute once the call is made.

The reason tail calls are interesting has to do with the way functions are typically called. When you make a function call, the compiler generates code to set up the environment in which the function executes. This usually involves pushing arguments onto the call stack, passing information about where to return to, and setting up a stack frame for local variables. When the function finishes executing, the stack frame is cleaned up and control is returned to the caller. But when the function ends with a tail call, the compiler can skip some of the work involved. For instance, the calling function’s stack frame is no longer needed, so its space can be reused by the called function. And when making the tail call, the place for the subroutine to return to is the caller of the function, not the function itself (in our example, we don’t need to return to first because we can return to whatever called first).

Let’s look at this in more concrete terms. Unoptimized assembly for the first function might look like this:

// Set up the stack frame for first
push  ebp
mov  ebp, esp
sub  esp, 4  // Allocates four bytes of local variable space for j
// Save the callee-saved registers on the stack (we will restore them when done)
push  ebx
push  esi
push  edi
// Get the value of the parameter i into eax
mov  eax, [ebp+4]
// Call the function second
push  eax
call  second
// The return value lives in eax, so assign that to our local variable j
mov  [ebp-4], eax
// Get the value of our local variable j into eax
mov  eax, [ebp-4]
// Call the function last
push  eax
call last
// Clean up the function and return
pop  edi
pop  esi
pop  edx
mov  esp, ebp
pop  ebp
ret 0

If we were to optimize just the tail call part of the code, our assembly would look like this:

// Set up the stack frame for first
push  ebp
mov  ebp, esp
sub  esp, 4  // Allocates four bytes of local variable space for j
// Save the callee-saved registers on the stack (we will restore them when done)
push  ebx
push  esi
push  edi
// Get the value of the parameter i into eax
mov  eax, [ebp+4]
// Call the function second
push  eax
call  second
// The return value lives in eax, so assign that to our local variable j
mov  [ebp-4], eax
// Get the value of our local variable j into eax
mov  eax, [ebp-4]
// Call the function last with tail call optimization
mov [ebp+4], eax  // Put EAX in the parameter space of the stack frame
jmp last // Unconditional jump to last

There are a few things to notice about the optimized code. For starters, when calling last, no new data is pushed onto the stack — we’re simply reusing the stack space for first since it is no longer needed. (This isn’t always possible to do and is an optimization that’s unlikely to appear in anything other than hand-tuned code.) Also, we are using an unconditional jump to call last instead of using the call instruction. The call instruction does two things; it saves the current location to the stack so when the ret instruction is used, it goes back to the caller, and then it does an unconditional jump to the address given. With a tail call, we don’t need to save the location to jump back to — the function will use the ret instruction like normal and return to the caller of first instead of to first itself. This is an optimization because it removes the push to the stack (saving space and time) and we don’t need to jump back to first just to jump back to who called first, also saving time.

At this point, you might be thinking “whoa, tail calls are amazing, I should be using them everywhere!” Alas, but they don’t provide a considerable benefit to most code because there are so few cases where they crop up organically in reality. However, tail calls are amazing enough that you may want to structure your code to take advantage of them in certain circumstances. When you use tail calls as part of recursion, they provide considerable benefit. In fact, they can change the space requirements from linear (O(n)) to constant (O(1))! When coupled with recursion, tail calls can be truly powerful.

For instance:

int recur( int i ) {
  return recur( ++i );
}
 
int recur2( int i ) {
  return recur2( i ) + 1;
}

Both of these functions do exactly the same thing, but in slightly different ways — they take the initial value, increment it and return the new value. However, only one of these crashes with optimizations turned on! recur2 will crash because a new stack frame must be allocated each time the function recurses due to modifying the returned value. So the call will eventually exhaust the stack space and crash. However, recur will not crash because it is tail call recursive, meaning that no new stack frames will be allocated to call it. So it is effectively just an infinite loop like while(1) or for(;;)!

Before we go into application of this optimization, let’s recap what the requirements are for it and see some more examples. In order for something to be eligible for tail call optimization the last instruction executed before returning to the caller must be a method call. That can sometimes take surprising shapes:

// Eligible for TCO because there is no code following
// the calls to bar or baz (it's an implied return)
void foo( int i ) {
  if (i % 2)
    bar();
  else
    baz();
}
 
// Not eligible for TCO because code must execute
// after the call to baz
int foo( int i ) {
  if (i % 2)
    bar();
  int j = baz();
  return j + i;
}
 
// Eligible for TCO because there is no code
// following the call to printf, though it is
// possible that TCO will be unavailable due to
// the varargs nature of printf
void foo( int i ) {
  ::printf( "%d: %s\n", i, bar( i ) );
}

Now that you have a better grasp on what tail calls look like, let’s talk about one particular problem that gets considerable benefit from it. When writing a compiler, there are two major ways to write the parser for the language. One is using a table-driven approach, usually by using a parser generator application such as bison or yacc. The other is by writing a recursive descent parser by hand. Both approaches have their pros and cons, but one specific con of a recursive descent parser is that they can be very expensive in terms of space requirements. It’s not uncommon to see call stack depths several dozen calls deep, depending on the language being parsed. This is a prime case where tail call recursion can be a considerable optimization. To this end, if you are writing a recursive descent parser you would do well to have the exit for your functions either return a function call, or return a call to create an AST node.

For instance, your parser should look something like this:

typedef struct Node {
  void *data;
} Node;
 
Node *ParseBar() {
  return new Node();
}
 
Node *ParseFoo() {
  if (::rand() % 2)
    return ParseBar();
  else
    return new Node();
}
 
Node *ParseTop() {
  switch (::rand()) {
  case 0:
    return ParseBar();
  case 1:
    return ParseFoo();
  }
  return nullptr;
}

All of these functions are theoretically eligible for tail call optimization because there is no code that needs to execute past the end of the function calls. But does theory and practice can be two different things, so how do they stack up with the major compilers?

I tried this sample code in both Visual Studio 2010 and gcc (4.6.2) with full optimizations turned on, and they both did reasonably well.

// Output from gcc, after being cleaned up and annotated
_Z8ParseBarv:
    // Set up stack frame
    subl    $28, %esp
    movl    $4, (%esp)
    // Call new Node
    call    _Znwj
    movl    $0, (%eax)
    // Tear down stack frame, returning eax
    addl    $28, %esp
    ret
 
_Z8ParseFoov:
    // Set up stack frame
    subl    $28, %esp
    // Call rand
    call    rand
    // Test the low bit of eax and if it's zero
    // (meaning it's even), jump to .L5
    testb   $1, %al
    jne .L5
    // It's an odd number, so create a new Node
    movl    $4, (%esp)
    call    _Znwj
    movl    $0, (%eax)
    // Tear down stack frame and return eax
    addl    $28, %esp
    ret
.L5:
    // Tear down stack frame
    addl    $28, %esp
    // TCO: jump to ParseBar
    jmp _Z8ParseBarv
 
_Z8ParseTopv:
    // Set up stack frame
    subl    $12, %esp
    // Call rand
    call    rand
    // If it's non-zero, jump to .L11
    testl   %eax, %eax
    jne .L11
    // Tear down stack frame
    addl    $12, %esp
    // TCO: jump to ParseBar
    jmp _Z8ParseBarv
.L11:
    // Test to see if it's 1, and if it is,
    // jump to .L12
    cmpl    $1, %eax
    je  .L12
    // Tear down stack frame and return nullptr
    xorl    %eax, %eax
    addl    $12, %esp
    ret
.L12:
    // Tear down stack frame
    addl    $12, %esp
    // TCO: jump to ParseFoo
    jmp _Z8ParseFoov
// Output from MSVC 2010 after being cleaned up and annotated
?ParseBar@@YGPAUNode@@XZ
    // Call operator new
    push    4
    call    ??2@YAPAXI@Z
    // Tear down stack frame
    add esp, 4
    // Check if the returned node is null, and if
    // it is, return null
    test    eax, eax
    je  SHORT $LN3@ParseBar
    // Zero-initialize the structure member and
    // return it
    mov DWORD PTR [eax], 0
    ret 0
$LN3@ParseBar:
    // Return null
    xor eax, eax
    ret 0
 
?ParseFoo@@YGPAUNode@@XZ
    // Call rand
    call    DWORD PTR __imp__rand
    // Check if the value is signed, and the low bit
    // and if it is not signed, jump to $LN7@ParseFoo
    and eax, -2147483647
    jns SHORT $LN7@ParseFoo
    // Since it is signed, check whether the low
    // bit is set
    dec eax
    or  eax, -2
    inc eax
$LN7@ParseFoo:
    // If the low bit is set, which means the number is
    // odd, jump to $LN2@ParseFoo
    je  SHORT $LN2@ParseFoo
    // TCO: jump to ParseBar
    jmp ?ParseBar@@YGPAUNode@@XZ
$LN2@ParseFoo:
    // Call operator new
    push    4
    call    ??2@YAPAXI@Z
    // Tear down stack frame
    add esp, 4
    // Check if the returned node is null, and if
    // it is, return null
    test    eax, eax
    je  SHORT $LN5@ParseFoo
    // Zero-initialize the structure member and
    // return it
    mov DWORD PTR [eax], 0
    ret 0
$LN5@ParseFoo:
    // Return null
    xor eax, eax
    ret 0
 
?ParseTop@@YGPAUNode@@XZ
    // Call rand
    call    DWORD PTR __imp__rand
    // If the value is 0, jump to $LN2@ParseTop
    sub eax, 0
    je  SHORT $LN2@ParseTop
    // Subtract one, and if the value is now 0,
    // jump to $LN1@ParseTop
    dec eax
    je  SHORT $LN1@ParseTop
    // Return 0
    xor eax, eax
    ret 0
$LN1@ParseTop:
    // TCO: jump to ParseFoo
    jmp ?ParseFoo@@YGPAUNode@@XZ
$LN2@ParseTop:
    // TCO: jump to ParseBar
    jmp ?ParseBar@@YGPAUNode@@XZ
// Output from clang, after being cleaned up and annotated
__Z8ParseBarv:
    // Set up stack frame
    pushl   %ebp
    movl    %esp, %ebp
    subl    $8, %esp
    // Call operator new
    movl    $4, (%esp)
    calll   __Znwm
    // Assume the call succeeded and
    // initialize the structure member
    movl    $0, (%eax)
    // Tear down stack frame and return
    addl    $8, %esp
    popl    %ebp
    ret
 
__Z8ParseFoov:
    // Set up stack frame
    pushl   %ebp
    movl    %esp, %ebp
    subl    $8, %esp
    // Call rand
    calll   _rand
    // If it's non-zero, jump to LBB1_2
    testb   $1, %al
    jne LBB1_2
    // Call operator new
    movl    $4, (%esp)
    calll   __Znwm
    // Assume the call succeeded and
    // initialize the structure member
    movl    $0, (%eax)
    // Tear down stack frame and return
    addl    $8, %esp
    popl    %ebp
    ret
LBB1_2:
    // TCO: jump to ParseBar
    addl    $8, %esp
    popl    %ebp
    jmp __Z8ParseBarv
 
__Z8ParseTopv:
    // Set up stack frame
    pushl   %ebp
    movl    %esp, %ebp
    subl    $8, %esp
    // Call rand
    calll   _rand
    // Test to see if it's 1, and if it is,
    // jump to LBB2_3  
    cmpl    $1, %eax
    je  LBB2_3
    // If it's non-zero, jump to LBB2_4
    testl   %eax, %eax
    jne LBB2_4
    // TCO: jump to ParseBar
    addl    $8, %esp
    popl    %ebp
    jmp __Z8ParseBarv
LBB2_3:
    // TCO: jump to ParseFoo
    addl    $8, %esp
    popl    %ebp
    jmp __Z8ParseFoov
LBB2_4:
    // Set eax to zero, tear down the stack
    // frame and return
    xorl    %eax, %eax
    addl    $8, %esp
    popl    %ebp
    ret
// Output from ICC, after being cleaned up and annotated
__Z8ParseTopv:
    // Set up stack frame
        subl      $12, %esp
        // Call rand
        call      L_rand.stub
        // If the result is not zero, jump to L_B2.5
        testl     %eax, %eax
        jne       L_B2.5
        // Call the ParseBar function
        call      L__Z8ParseBarv.stub
        // Tear down stack frame and return
        addl      $12, %esp
        ret
L_B2.5:
    // If the result is not one, jump to L_B2.8
        cmpl      $1, %eax
        jne       L_B2.8
        // Call the ParseFoo function
        call      L__Z8ParseFoov.stub
        // Tear down the stack frame and return
        addl      $12, %esp
        ret
L_B2.8:
    // Set eax to 0, tear down the stack frame
    // and return
        xorl      %eax, %eax
        addl      $12, %esp
        ret
 
__Z8ParseBarv:
    // Set up stack frame
        subl      $28, %esp
        // Call operator new
        movl      $4, (%esp)
        call      L__Znwm.stub
    // Check if the returned node is null, and if
    // it is, jump to L_B3.5
        testl     %eax, %eax
        je        L_B3.5
    // Zero-initialize the structure member and
    // return it
        movl      $0, (%eax)
        addl      $28, %esp
        ret
L_B3.5:
    // Set eax to 0, tear down the stack frame
    // and return
        xorl      %eax, %eax
        addl      $28, %esp
        ret
 
__Z8ParseFoov:
    // Set up stack frame
        subl      $28, %esp
        // Call rand
        call      L_rand.stub
    // Check if the value is signed, and the low bit
    // and if it is not signed, jump to L_B4.13
        andl      $-2147483647, %eax
        jge       L_B4.13
    // Since it is signed, check whether the low
    // bit is set
        subl      $1, %eax
        orl       $-2, %eax
        incl      %eax
L_B4.13:
    // If the low bit is set, which means the number is
    // odd, jump to L_B4.5
        testl     %eax, %eax
        je        L_B4.5
        // Call ParseBar
        call      L__Z8ParseBarv.stub
        // Tear down stack frame and return
        addl      $28, %esp
        ret
L_B4.5:
    // Call operator new
        movl      $4, (%esp)
        call      L__Znwm.stub
    // Check if the returned node is null, and if
    // it is, jump to L_B4.9
        testl     %eax, %eax
        je        L_B4.9
    // Zero-initialize the structure member and
    // return it
        movl      $0, (%eax)
        addl      $28, %esp
        ret
L_B4.9:
    // Set eax to 0, tear down the stack frame
    // and return
        xorl      %eax, %eax
        addl      $28, %esp
        ret

So both MSVC, gcc and clang all do a reasonable job of performing tail call optimizations which would yield great benefits for a recursive descent parser. Oddly enough, it turns out that ICC does not perform tail call optimizations on this code, and I’m not the only one to have problems. However, Intel does claim they support tail call optimizations as of ICC 7.07 (March 2008).

<aside>
I noticed something very strange with the call to operator new generated by all of the compilers. You’ll notice that there’s a test for returning null, and if it’s non-null, it dereferences the pointer and assigns zero. This is baffling to me as there’s nothing in the C++ specification that I could find which suggests this should happen. If operator new is of the throwing variety (which it is by default for all four compilers), then the call to operator new will throw instead of returning null, so there’s no need to check for null (which is why clang elides the check). But once the call to operator new has returned, why set the structure member to zero with optimizations turned all the way up?
</aside>

So now you know a bit more about what tail calls are, how they can be optimized, and why you may want to use them. Hopefully this was informative! It was certainly an interesting topic for me to write about.

This entry was posted in C/C++ and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *