A tail call is a specific pattern of source code where the last instruction executed in a method is another function call. For instance:
int last( int i ); int second( int i ); int first( int i ) { int j = second( i ); return last( j ); } |
In this code, the call to last is considered a tail call because there are no further instructions to execute once the call is made.
The reason tail calls are interesting has to do with the way functions are typically called. When you make a function call, the compiler generates code to set up the environment in which the function executes. This usually involves pushing arguments onto the call stack, passing information about where to return to, and setting up a stack frame for local variables. When the function finishes executing, the stack frame is cleaned up and control is returned to the caller. But when the function ends with a tail call, the compiler can skip some of the work involved. For instance, the calling function’s stack frame is no longer needed, so its space can be reused by the called function. And when making the tail call, the place for the subroutine to return to is the caller of the function, not the function itself (in our example, we don’t need to return to first because we can return to whatever called first).
Let’s look at this in more concrete terms. Unoptimized assembly for the first function might look like this:
// Set up the stack frame for first push ebp mov ebp, esp sub esp, 4 // Allocates four bytes of local variable space for j // Save the callee-saved registers on the stack (we will restore them when done) push ebx push esi push edi // Get the value of the parameter i into eax mov eax, [ebp+4] // Call the function second push eax call second // The return value lives in eax, so assign that to our local variable j mov [ebp-4], eax // Get the value of our local variable j into eax mov eax, [ebp-4] // Call the function last push eax call last // Clean up the function and return pop edi pop esi pop edx mov esp, ebp pop ebp ret 0 |
If we were to optimize just the tail call part of the code, our assembly would look like this:
// Set up the stack frame for first push ebp mov ebp, esp sub esp, 4 // Allocates four bytes of local variable space for j // Save the callee-saved registers on the stack (we will restore them when done) push ebx push esi push edi // Get the value of the parameter i into eax mov eax, [ebp+4] // Call the function second push eax call second // The return value lives in eax, so assign that to our local variable j mov [ebp-4], eax // Get the value of our local variable j into eax mov eax, [ebp-4] // Call the function last with tail call optimization mov [ebp+4], eax // Put EAX in the parameter space of the stack frame jmp last // Unconditional jump to last |
There are a few things to notice about the optimized code. For starters, when calling last, no new data is pushed onto the stack — we’re simply reusing the stack space for first since it is no longer needed. (This isn’t always possible to do and is an optimization that’s unlikely to appear in anything other than hand-tuned code.) Also, we are using an unconditional jump to call last instead of using the call instruction. The call instruction does two things; it saves the current location to the stack so when the ret instruction is used, it goes back to the caller, and then it does an unconditional jump to the address given. With a tail call, we don’t need to save the location to jump back to — the function will use the ret instruction like normal and return to the caller of first instead of to first itself. This is an optimization because it removes the push to the stack (saving space and time) and we don’t need to jump back to first just to jump back to who called first, also saving time.
At this point, you might be thinking “whoa, tail calls are amazing, I should be using them everywhere!” Alas, but they don’t provide a considerable benefit to most code because there are so few cases where they crop up organically in reality. However, tail calls are amazing enough that you may want to structure your code to take advantage of them in certain circumstances. When you use tail calls as part of recursion, they provide considerable benefit. In fact, they can change the space requirements from linear (O(n)) to constant (O(1))! When coupled with recursion, tail calls can be truly powerful.
For instance:
int recur( int i ) { return recur( ++i ); } int recur2( int i ) { return recur2( i ) + 1; } |
Both of these functions do exactly the same thing, but in slightly different ways — they take the initial value, increment it and return the new value. However, only one of these crashes with optimizations turned on! recur2 will crash because a new stack frame must be allocated each time the function recurses due to modifying the returned value. So the call will eventually exhaust the stack space and crash. However, recur will not crash because it is tail call recursive, meaning that no new stack frames will be allocated to call it. So it is effectively just an infinite loop like while(1) or for(;;)!
Before we go into application of this optimization, let’s recap what the requirements are for it and see some more examples. In order for something to be eligible for tail call optimization the last instruction executed before returning to the caller must be a method call. That can sometimes take surprising shapes:
// Eligible for TCO because there is no code following // the calls to bar or baz (it's an implied return) void foo( int i ) { if (i % 2) bar(); else baz(); } // Not eligible for TCO because code must execute // after the call to baz int foo( int i ) { if (i % 2) bar(); int j = baz(); return j + i; } // Eligible for TCO because there is no code // following the call to printf, though it is // possible that TCO will be unavailable due to // the varargs nature of printf void foo( int i ) { :: printf ( "%d: %s\n" , i, bar( i ) ); } |
Now that you have a better grasp on what tail calls look like, let’s talk about one particular problem that gets considerable benefit from it. When writing a compiler, there are two major ways to write the parser for the language. One is using a table-driven approach, usually by using a parser generator application such as bison or yacc. The other is by writing a recursive descent parser by hand. Both approaches have their pros and cons, but one specific con of a recursive descent parser is that they can be very expensive in terms of space requirements. It’s not uncommon to see call stack depths several dozen calls deep, depending on the language being parsed. This is a prime case where tail call recursion can be a considerable optimization. To this end, if you are writing a recursive descent parser you would do well to have the exit for your functions either return a function call, or return a call to create an AST node.
For instance, your parser should look something like this:
typedef struct Node { void *data; } Node; Node *ParseBar() { return new Node(); } Node *ParseFoo() { if (:: rand () % 2) return ParseBar(); else return new Node(); } Node *ParseTop() { switch (:: rand ()) { case 0: return ParseBar(); case 1: return ParseFoo(); } return nullptr ; } |
All of these functions are theoretically eligible for tail call optimization because there is no code that needs to execute past the end of the function calls. But does theory and practice can be two different things, so how do they stack up with the major compilers?
I tried this sample code in both Visual Studio 2010 and gcc (4.6.2) with full optimizations turned on, and they both did reasonably well.
// Output from gcc, after being cleaned up and annotated _Z8ParseBarv: // Set up stack frame subl $28, %esp movl $4, (%esp) // Call new Node call _Znwj movl $0, (%eax) // Tear down stack frame, returning eax addl $28, %esp ret _Z8ParseFoov: // Set up stack frame subl $28, %esp // Call rand call rand // Test the low bit of eax and if it's zero // (meaning it's even), jump to .L5 testb $1, %al jne .L5 // It's an odd number, so create a new Node movl $4, (%esp) call _Znwj movl $0, (%eax) // Tear down stack frame and return eax addl $28, %esp ret .L5: // Tear down stack frame addl $28, %esp // TCO: jump to ParseBar jmp _Z8ParseBarv _Z8ParseTopv: // Set up stack frame subl $12, %esp // Call rand call rand // If it's non-zero, jump to .L11 testl %eax, %eax jne .L11 // Tear down stack frame addl $12, %esp // TCO: jump to ParseBar jmp _Z8ParseBarv .L11: // Test to see if it's 1, and if it is, // jump to .L12 cmpl $1, %eax je .L12 // Tear down stack frame and return nullptr xorl %eax, %eax addl $12, %esp ret .L12: // Tear down stack frame addl $12, %esp // TCO: jump to ParseFoo jmp _Z8ParseFoov |
// Output from MSVC 2010 after being cleaned up and annotated ?ParseBar@@YGPAUNode@@XZ // Call operator new push 4 call ??2@YAPAXI@Z // Tear down stack frame add esp, 4 // Check if the returned node is null, and if // it is, return null test eax, eax je SHORT $LN3@ParseBar // Zero-initialize the structure member and // return it mov DWORD PTR [eax], 0 ret 0 $LN3@ParseBar: // Return null xor eax, eax ret 0 ?ParseFoo@@YGPAUNode@@XZ // Call rand call DWORD PTR __imp__rand // Check if the value is signed, and the low bit // and if it is not signed, jump to $LN7@ParseFoo and eax, -2147483647 jns SHORT $LN7@ParseFoo // Since it is signed, check whether the low // bit is set dec eax or eax, -2 inc eax $LN7@ParseFoo: // If the low bit is set, which means the number is // odd, jump to $LN2@ParseFoo je SHORT $LN2@ParseFoo // TCO: jump to ParseBar jmp ?ParseBar@@YGPAUNode@@XZ $LN2@ParseFoo: // Call operator new push 4 call ??2@YAPAXI@Z // Tear down stack frame add esp, 4 // Check if the returned node is null, and if // it is, return null test eax, eax je SHORT $LN5@ParseFoo // Zero-initialize the structure member and // return it mov DWORD PTR [eax], 0 ret 0 $LN5@ParseFoo: // Return null xor eax, eax ret 0 ?ParseTop@@YGPAUNode@@XZ // Call rand call DWORD PTR __imp__rand // If the value is 0, jump to $LN2@ParseTop sub eax, 0 je SHORT $LN2@ParseTop // Subtract one, and if the value is now 0, // jump to $LN1@ParseTop dec eax je SHORT $LN1@ParseTop // Return 0 xor eax, eax ret 0 $LN1@ParseTop: // TCO: jump to ParseFoo jmp ?ParseFoo@@YGPAUNode@@XZ $LN2@ParseTop: // TCO: jump to ParseBar jmp ?ParseBar@@YGPAUNode@@XZ |
// Output from clang, after being cleaned up and annotated __Z8ParseBarv: // Set up stack frame pushl %ebp movl %esp, %ebp subl $8, %esp // Call operator new movl $4, (%esp) calll __Znwm // Assume the call succeeded and // initialize the structure member movl $0, (%eax) // Tear down stack frame and return addl $8, %esp popl %ebp ret __Z8ParseFoov: // Set up stack frame pushl %ebp movl %esp, %ebp subl $8, %esp // Call rand calll _rand // If it's non-zero, jump to LBB1_2 testb $1, %al jne LBB1_2 // Call operator new movl $4, (%esp) calll __Znwm // Assume the call succeeded and // initialize the structure member movl $0, (%eax) // Tear down stack frame and return addl $8, %esp popl %ebp ret LBB1_2: // TCO: jump to ParseBar addl $8, %esp popl %ebp jmp __Z8ParseBarv __Z8ParseTopv: // Set up stack frame pushl %ebp movl %esp, %ebp subl $8, %esp // Call rand calll _rand // Test to see if it's 1, and if it is, // jump to LBB2_3 cmpl $1, %eax je LBB2_3 // If it's non-zero, jump to LBB2_4 testl %eax, %eax jne LBB2_4 // TCO: jump to ParseBar addl $8, %esp popl %ebp jmp __Z8ParseBarv LBB2_3: // TCO: jump to ParseFoo addl $8, %esp popl %ebp jmp __Z8ParseFoov LBB2_4: // Set eax to zero, tear down the stack // frame and return xorl %eax, %eax addl $8, %esp popl %ebp ret |
// Output from ICC, after being cleaned up and annotated __Z8ParseTopv: // Set up stack frame subl $12, %esp // Call rand call L_rand.stub // If the result is not zero, jump to L_B2.5 testl %eax, %eax jne L_B2.5 // Call the ParseBar function call L__Z8ParseBarv.stub // Tear down stack frame and return addl $12, %esp ret L_B2.5: // If the result is not one, jump to L_B2.8 cmpl $1, %eax jne L_B2.8 // Call the ParseFoo function call L__Z8ParseFoov.stub // Tear down the stack frame and return addl $12, %esp ret L_B2.8: // Set eax to 0, tear down the stack frame // and return xorl %eax, %eax addl $12, %esp ret __Z8ParseBarv: // Set up stack frame subl $28, %esp // Call operator new movl $4, (%esp) call L__Znwm.stub // Check if the returned node is null, and if // it is, jump to L_B3.5 testl %eax, %eax je L_B3.5 // Zero-initialize the structure member and // return it movl $0, (%eax) addl $28, %esp ret L_B3.5: // Set eax to 0, tear down the stack frame // and return xorl %eax, %eax addl $28, %esp ret __Z8ParseFoov: // Set up stack frame subl $28, %esp // Call rand call L_rand.stub // Check if the value is signed, and the low bit // and if it is not signed, jump to L_B4.13 andl $-2147483647, %eax jge L_B4.13 // Since it is signed, check whether the low // bit is set subl $1, %eax orl $-2, %eax incl %eax L_B4.13: // If the low bit is set, which means the number is // odd, jump to L_B4.5 testl %eax, %eax je L_B4.5 // Call ParseBar call L__Z8ParseBarv.stub // Tear down stack frame and return addl $28, %esp ret L_B4.5: // Call operator new movl $4, (%esp) call L__Znwm.stub // Check if the returned node is null, and if // it is, jump to L_B4.9 testl %eax, %eax je L_B4.9 // Zero-initialize the structure member and // return it movl $0, (%eax) addl $28, %esp ret L_B4.9: // Set eax to 0, tear down the stack frame // and return xorl %eax, %eax addl $28, %esp ret |
So both MSVC, gcc and clang all do a reasonable job of performing tail call optimizations which would yield great benefits for a recursive descent parser. Oddly enough, it turns out that ICC does not perform tail call optimizations on this code, and I’m not the only one to have problems. However, Intel does claim they support tail call optimizations as of ICC 7.07 (March 2008).
<aside>
I noticed something very strange with the call to operator new generated by all of the compilers. You’ll notice that there’s a test for returning null, and if it’s non-null, it dereferences the pointer and assigns zero. This is baffling to me as there’s nothing in the C++ specification that I could find which suggests this should happen. If operator new is of the throwing variety (which it is by default for all four compilers), then the call to operator new will throw instead of returning null, so there’s no need to check for null (which is why clang elides the check). But once the call to operator new has returned, why set the structure member to zero with optimizations turned all the way up?
</aside>
So now you know a bit more about what tail calls are, how they can be optimized, and why you may want to use them. Hopefully this was informative! It was certainly an interesting topic for me to write about.