Click here to Skip to main content
15,867,568 members
Articles / Programming Languages / MSIL
Article

JIT Optimizations

Rate me:
Please Sign up or sign in to vote.
4.96/5 (73 votes)
4 May 2008CPOL19 min read 93K   80   12
In this article, we will look into JIT optimizations, with specific focus on inlining.

Introduction

The .NET Just-In-Time Compiler (JIT) is considered by many to be one of the primary performance advantages of the CLR in comparison to the JVM and other managed environments that use just-in-time-compiled byte-code. It is this advantage that makes the JIT internals so secret and the JIT optimizations so mysterious. It is also the reason why the SSCLI (Rotor) contains only a naive fast-JIT implementation, which performs only the minimal optimizations while translating each IL instruction to its corresponding machine language instruction sequence.

During my research for the .NET Performance course I wrote for Sela, I found a couple of sources regarding the known JIT optimizations. The following is a non-exhaustive list of those optimizations, followed by a detailed discussion of interface method dispatching and the JIT in-lining techniques employed thereof.

Before we start, there are a couple of points worth mentioning. First of all, it's pretty obvious that in order to see JIT optimizations, you must look at the assembly code emitted by the JIT in run-time for the release build of your application. However, there is a minor caveat: if you try to look at the assembly code from within a Visual Studio debug session, you will not see optimized code. This is due to the fact that JIT optimizations are disabled by default when the process is launched under the debugger (for convenience reasons). Therefore, in order to see JIT-optimized code, you must attach Visual Studio to a running process, or run the process from within CorDbg with the JitOptimizations flag on (by issuing the "mode JitOptimizations 1" command from within the CorDbg command prompt). Finally, in the Caveats section, bear in mind that I do not work for Microsoft and do not have access to the actual sources of the JIT compiler. Therefore, nothing in the following analysis should be taken for granted.

Another important point to mention is that most of this article is based on the x86 JIT that comes with the .NET 2.0 CLR, although we will also take a look at the x64 JIT's behavior.

Range-check elimination

When accessing an array in a loop, and the loop's termination condition relies upon the length of the array, the array bounds access check can be eliminated. Consider the following code. Do you see anything wrong with it?

C#
private static int[] _array = new int[10];
private static volatile int _i;
static void Main(string[] args)
{
    for (int i = 0; i < _array.Length; ++i)
        _i += _array[i];
}

Here is the generated 32-bit code:

ASM
00BE00A5  xor         edx,edx                          ; edx = 0 (i)

00BE00A7  mov         eax,dword ptr ds:[02491EC4h]     ; eax = numbers
00BE00AC  cmp         edx,dword ptr [eax+4]            ; edx >= Length?
00BE00AF  jae         00BE00C2                         ; exception!
00BE00B1  mov         eax,dword ptr [eax+edx*4+8]      ; eax = numbers[i]
00BE00B5  add         dword ptr ds:[0A22FC8h],eax      ; _i += eax

00BE00BB  inc         edx                              ; ++edx (i)
00BE00BC  cmp         edx,0Ah                          ; edx < 100?
00BE00BF  jl          00BE00A7

The first line is the preamble, the lines in the middle are the loop body, and the three lines at the end are the code to increment the counter and test for the loop termination condition.

The third line in the above listing (00BE00AC) performs a range check – it ensures that the EDX register, used to index inside the array, is not greater than or equal to the array length at [EAX + 4] (EAX holds the array address, which is known to hold the array length at a 4-byte offset from its start). In addition, there is a loop termination check performed at the second-to-last line of the listing (00BE00BC).

Where is the range check elimination, then? The reason for this behavior is the simple fact that the array reference itself is static in this scenario. Working with a static reference causes the above code to be generated (also note how in 00BE00A7 we actually fetch the array reference into the register with each iteration of the loop). This behavior can be eliminated by making a very simple modification to the above program:

C#
private static int[] _array = new int[10];
private static volatile int _i;
static void Main(string[] args)
{
    int[] localRef = _array;
    for (int i = 0; i < localRef.Length; ++i)
    {
        _i += localRef[i];
    }
}

Here is the generated 32-bit code:

ASM
009D00D2  mov         ecx,dword ptr ds:[1C21EC4h]      ; ecx = numbers
009D00D8  xor         edx,edx                          ; edx = 0 (i)
009D00DA  mov         esi,dword ptr [ecx+4]            ; esi = numbers.Length
009D00DD  test        esi,esi                          ; esi == 0?
009D00DF  jle         009D00F0 
009D00E1  mov         eax,dword ptr [ecx+edx*4+8]      ; loop proper
009D00E5  add         dword ptr ds:[922FC8h],eax       ; _i += numbers[i]
009D00EB  inc         edx                              ; ++edx (i)
009D00EC  cmp         esi,edx                          ; edx > numbers.Length?
009D00EE  jg          009D00E1

Note how the range check has been eliminated, and now the loop termination condition is the only test performed in this code. Note: this discussion mirrors Greg Young's post on the subject.

It's also worth noting that it is very easy to break this optimization by using something other than Array.Length as the loop termination condition. For example, the following code (which is functionally identical) will generate the array bounds check and ruin the optimization for us:

C#
for (int i = 0; i < localRef.Length * 2; ++i)
{
    _i += localRef[i / 2];
}

Method In-lining

A short and simple method that is frequently used can be in-lined into the calling code. Currently, the JIT is documented (perhaps "blogged about" would be more precise here) to inline methods that are less than 32 bytes in length, do not contain any complex branching logic, and do not contain any exception handling-related mechanisms. See David Notario's blog for some additional information on this topic (note that it is not entirely relevant for CLR 2.0).

Given the following code fragment, let's examine what happens when the JIT doesn't optimize and inline the method call, and what happens when it does.

C#
public class Util
{
    public static int And(int first, int second)
    {
        return first & second;
    }
        [DllImport("kernel32")]
    public static extern void DebugBreak();
}
class Program
{
    private static volatile int _i;
    static void Main(string[] args)
    {
        Util.DebugBreak();
        _i = Util.And(5, 4);
    }
}

Note the use of the kernel32.dll DebugBreak function (which internally issues a software interrupt, int 3). I use it here so that Windows will offer me the opportunity to debug the process when this method is called, so that I don't have to attach to it manually from within Visual Studio or any other debugger of choice. Finally, note that I made the _i field volatile so that the assignment to it is not optimized away.

When stepping into the disassembly with JIT optimizations disabled (for example, when the process is started from within a Visual Studio debugging session), the following code is emitted for the method call site:

ASM
00000013  mov         edx,4 
00000018  mov         ecx,5 
0000001d  call        dword ptr ds:[00A2967Ch]         ; this is Util.And
00000023  mov         esi,eax 
00000025  mov         dword ptr ds:[00A28A6Ch],esi     ; this is _i

If we keep stepping into the code at 00A2967C, we find the method itself:

ASM
00000000  push        edi  
00000001  push        esi  
00000002  mov         edi,ecx 
00000004  mov         esi,edx 
00000006  cmp         dword ptr ds:[00A28864h],0 
0000000d  je          00000014 
0000000f  call        794F1116 
00000014  mov         eax,edi 
00000016  and         eax,esi 
00000018  pop         esi  
00000019  pop         edi  
0000001a  ret

Note that there is no optimization or in-lining here: the parameters are passed to the method in the EDX and ECX registers (fastcall calling convention), and the AND instruction at offset 0x16 performs the method's actual purpose.

Now, let's look at the in-lined call, generated when I attach to the process after the debug breakpoint is issued inside it. This is what the method call site looks like, this time:

ASM
00B90075  mov         dword ptr ds:[0A22FD0h],4    ; _i = 4
00B9007F  ret

The result of AND(5, 4) is 4, and this is the value that is directly written into the volatile member field. Note that the optimization is so aggressive that the AND operation doesn't even occur – it can be computed directly (folded) during compile-time.

However, it is seemingly impossible to inline a virtual method call. This, of course, is due to the fact that the actual method to be called is unknown at compile-time and can change between method invocations. For example, consider the following code:

C#
class A
{
    public virtual void Foo() { }
}
class B : A
{
    public override void Foo() { }
}
class Program

{
    static void Method(A a)
    {
    a.Foo();
    }
    static void Main(string[] args)
    {
        for (int i = 0; i < 10; ++i)
        {
            A a = (i % 2 == 0) ? new A() : new B();
            Method(a);
        }
    }
}

When the JIT compiles the Method method, it has no means of knowing which of the two implementations should be called – A.Foo or B.Foo. Therefore, it is seemingly impossible to inline the call at the call site. (Note my repeated use of "seemingly" here – it is theoretically possible to perform a partial optimization that would introduce in-lining in selected cases, and I discuss it in depth later, when talking about interface method dispatching.)

Therefore, a virtual call must go through the actual object's method table. (If you need some refreshment on this, have a brief skim through this MSDN Magazine article.) Going through the method table involves two levels of indirection: using the object header to reach into the method table, and using a compile-time known offset into the method table to determine the method to call.

In the preceding example, the following code will be emitted at the call site (in the Method method):

ASM
007E0076  xor         esi,esi                   ; esi = 0 (i)
007E0078  jmp         007E00AA                  ; jump to compare condition
007E007A  mov         eax,esi                   ; i % 2 == 0?
007E007C  and         eax,80000001h 
007E0081  jns         007E0088 
007E0083  dec         eax  
007E0084  or          eax,0FFFFFFFEh 
007E0087  inc         eax  
007E0088  test        eax,eax 
007E008A  je          007E0098 
007E008C  mov         ecx,2B3180h 
007E0091  call        002A201C
007E0096  jmp         007E00A2 
007E0098  mov         ecx,2B3100h 
007E009D  call        002A201C
007E00A2  mov         ecx,eax                   ; ecx = a
007E00A4  mov         eax,dword ptr [ecx]       ; eax = a's method table
007E00A6  call        dword ptr [eax+38h]       ; call through offset 0x38
007E00A9  inc         esi  
007E00AA  cmp         esi,0Ah 
007E00AD  jl          007E007A 
007E00AF  pop         esi  
007E00B0  ret

As noted above, the virtual call is dispatched in two steps: first, reaching into the object's method table (007E00A4 mov eax, dword ptr [ecx]), and then calling the method pointer at a compile-time known offset into the method table (007EE00A6 call dword ptr [eax+38h]).

Note that theoretically, the sealed C# keyword (and its corresponding IL counterpart, final) is meant to minimize the inefficiency involved with virtual method dispatch by indicating that in spite of a method being virtual, it cannot be overridden anymore by any derived classes. For example, the following code does not have to involve the virtual method dispatch we have just seen:

C#
class A

{
    public virtual void Foo() { }
}
class B : A

{
    public override sealed void Foo() { }
}
class C : B
{
}
class Program
{
    static void Method(B b)
    {
    b.Foo();
    }
    static void Main(string[] args)
    {
        for (int i = 0; i < 10; ++i)
        {
            B b = (i % 2 == 0) ? new B() : new C();
            Method(b);
        }
     }
}

It is obvious that the call target b.Foo in the Method method is statically known: it's B.Foo that is going to be called. However, the JIT does not choose to use this information to prevent the virtual method dispatch – it is emitted nonetheless, as we can see from the following assembly code (this time, I've snipped the object setup).

ASM
005000A2  mov         ecx,eax 
005000A4  mov         eax,dword ptr [ecx] 
005000A6  call        dword ptr [eax+38h]

Note that using the sealed keyword on the class itself has no effect either, even though the call target can be statically known as well.

To fully understand what's going on in the background, it's worth your time to learn that IL has two instructions used for dispatching method calls: call and callvirt. One of the main reasons for using the callvirt instruction for instance method calls is the fact that the JIT-emitted code contains a check that ensures the instance is not null, and throws a NullReferenceException otherwise. This is why the C# compiler emits the callvirt IL instruction even when calling non-virtual instance methods. If this were not the behavior, the following code could compile and run successfully:

C#
class A
{
    public void Foo() { }   // Foo doesn't use "this"
}
class Program
{
    static void Main(string[] args)
    {
        for (int i = 0; i < 10; ++i)
        {
            A a = (i % 2 == 0) ? new A() : null;
            a.Foo();
        }
    }
}

Here's the IL that is emitted for this scenario (note the callvirt instruction at L_0013):

MSIL
.method private hidebysig static void Main(string[] args) cil managed
{
      .entrypoint
      .maxstack 2
      .locals init (
            [0] int32 num1,
            [1] CallAndCallVirt.A a1)
      L_0000: ldc.i4.0 
      L_0001: stloc.0 
      L_0002: br.s L_001c
      L_0004: ldloc.0 
      L_0005: ldc.i4.2 
      L_0006: rem 
      L_0007: brfalse.s L_000c
      L_0009: ldnull 
      L_000a: br.s L_0011
      L_000c: newobj instance void CallAndCallVirt.A::.ctor()
      L_0011: stloc.1 
      L_0012: ldloc.1 
      L_0013: callvirt instance void CallAndCallVirt.A::Foo()
      L_0018: ldloc.0 
      L_0019: ldc.i4.1 
      L_001a: add 
      L_001b: stloc.0 
      L_001c: ldloc.0 
      L_001d: ldc.i4.s 10
      L_001f: blt.s L_0004
      L_0021: ret 
}

And, here's the assembly that is emitted for this scenario:

ASM
00AD0076  xor         esi,esi 
00AD0078  jmp         00AD009F 
00AD007A  xor         edx,edx                   ; represents the "a" local variable
00AD007C  mov         eax,esi 
00AD007E  and         eax,80000001h 
00AD0083  jns         00AD008A 
00AD0085  dec         eax  
00AD0086  or          eax,0FFFFFFFEh 
00AD0089  inc         eax  
00AD008A  test        eax,eax 
00AD008C  jne         00AD009A 
00AD008E  mov         ecx,0A230E0h 
00AD0093  call        00A1201C                  ; constructor call (in the branch)
00AD0098  mov         edx,eax 
00AD009A  mov         eax,edx     
00AD009C  cmp         dword ptr [eax],eax       ; null reference check
00AD009E  inc         esi                       ; move on, the method isn't really called
00AD009F  cmp         esi,0Ah 
00AD00A2  jl          00AD007A 
00AD00A4  pop         esi  
00AD00A5  ret

So, is the JIT capable of in-lining such a method? Yes! Note how the JIT completely eliminates the call itself (the method is empty and, therefore, in-lining it means simply optimizing it away), but it can't eliminate the null reference check, because that was the intent of the callvirt keyword. Therefore, the instruction at 00AD009C performs a trivial null reference check by attempting to dereference EAX, which contains the value of the "a" local variable. If an access violation occurs, it is caught, and a NullReferenceException exception is thrown in its stead.

If the IL in L_0003 were to use the call instruction instead of the callvirt instruction, the assembly code that would have been emitted wouldn't have the null reference check embedded in it, and the above code could run successfully. This is not the behavior adopted by the C# language developers.

For the sake of completeness, it is worth noting that sometimes calling a virtual method involves the call instruction and not the callvirt instruction. This occurs, for example, in the following code snippet:

C#
class Employee
{
    public override string ToString()
        {

        return base.ToString();
        }
}

The ToString method compiled to IL:

MSIL
.method public hidebysig virtual instance string ToString() cil managed
{
      .maxstack 8
      L_0000: ldarg.0 
      L_0001: call instance string object::ToString()
      L_0006: ret 
}

If the instruction emitted in this case were the callvirt instruction, Employee.ToString would call itself recursively forever. It is explicitly our purpose here to ignore the normal virtual method dispatching mechanisms and delegate our implementation to the base class' method. Therefore, the call to Object.ToString will not be emitted with the callvirt instruction, but with the call instruction.

Interface method dispatch is not very different in essence from virtual method dispatch. It is not very different in the sense that a level of indirection must be added and therefore the actual implementation to call cannot be determined at the call site. For example, consider the following code:

C#
interface IA
{

    void Foo();
}
class A : IA
{
    public void Foo()
        {
        }
}
class B : IA
{
    void IA.Foo()
    {
        }
}
class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
        static void Method(IA a)
        {
            a.Foo();
        }
        static void Main(string[] args)
        {
        for (int i = 0; i < 10; ++i)
        {
            IA a = (i % 2 == 0) ? (IA)new A() : (IA)new B();
            Method(a);
        }
        }
}

The trivial method dispatch technique, in this case, would be as follows:

  • Reach into the method table for the actual object passed.
  • Reach into the interface method table map for the interface used within the method table for the actual object, using the process-wide interface ID.
  • Reach into the interface method table for the method at the compile-time known offset, and execute it.

Therefore, the approximate assembly code that should be generated for the actual interface method call site (inside the Method method) is supposed to look like the following:

ASM
mov    ecx, edi                   ; ecx holds "this"
mov    eax, dword ptr [ecx]       ; eax holds method table
mov    eax, dword ptr [eax+0Ch]   ; eax holds interface method table
mov    eax, dword ptr [eax+30h]   ; eax holds method pointer
call   dword ptr [eax]

This is described in this MSDN Magazine article and other sources, but is not the case in real life. Even the debug (non-optimized) version of the interface dispatch doesn't look like this. I will examine what the code does look like, but for now, it will suffice to say that the naïve approach probably had dire performance characteristics and was therefore replaced with something different.

By the way, note that interface methods are always implicitly marked as virtual (consider the following IL for the A and B classes).

MSIL
.class private auto ansi beforefieldinit A extends object implements NaiveMethodDispatch.IA
      .method public hidebysig newslot virtual final instance void Foo() cil managed
.class private auto ansi beforefieldinit B extends object implements NaiveMethodDispatch.IA
      .method private hidebysig newslot virtual final instance void 

NaiveMethodDispatch.IA.Foo() cil managed
            .override NaiveMethodDispatch.IA::Foo

If the interface implementation is explicit, or if you didn't specify the virtual keyword when implementing the interface implicitly, the compiler also emits the final keyword (equivalent to the C# sealed keyword) on the method. This means that interface methods cannot be overridden unless they are explicitly marked as virtual in the base class' code; however, they are still virtual in the sense that the callvirt instruction must be used to call them, and that a method table lookup is required so that they are properly dispatched.

However, I would not even consider writing this article if everything were as simple as presented so far. The trigger for this article has actually been Ayende's post on a vaguely related topic, where he mentioned that the JIT is capable of in-lining interface method calls. A mild theoretical discussion over e-mail has resulted in some preliminary research, which I present below.

Before diving in, the conclusion of the following discussion can be summarized as follows: it is theoretically impossible to perfectly inline virtual method calls (interface method calls fall in this category as well); the JIT does not inline interface method calls; instead, the JIT performs an optimization that does not use the naïve interface method dispatch outlined above.

Before we follow what the actual CLR 2.0 JIT does at interface method call sites, it is probably wise to consider any optimization that could theoretically be performed on such calls. The most obvious are the following two:

Flow Analysis

Flow analysis can determine that a particular static type is used for the interface method call and therefore can be directly dispatched through the type instead of using the technique described above.

C#
IA a = new A();
a.Foo();

It is fairly clear that A.Foo is being called at this call site, no matter what happens. The managed compiler's flow analysis can detect this condition and emit the appropriate byte-code, which will allow the JIT to inline the method if it complies with the other in-lining requirements (see above).

Frequency Analysis

One of the interface implementations is called considerably more frequently than others, at a given call site. This can be determined by dynamically profiling and patching the code, or by some kind of hint from the programmer. (See this and Overview of the IBM Java Just-in-Time Compiler for more information on the topic, in general, and its JVM implementation, in particular.) In this case, the interface method dispatch for that specific interface can be modified to direct dispatch or even in-lined, as follows:

C#
if (obj->MethodTable == expectedMethodTable) {
      // Inlined code of "hot" method
}
else {
      // Normal interface method dispatch code
}

The JIT's Approach

The JIT is not documented (or "blogged about") to perform either of those optimizations. However, poking through the JIT-emitted code reveals that the case is not as simple as it appears to be.

Let's look at the call site for the following code and attempt to analyze what happens in the method dispatching process. I've given the interface and the methods meaningful names and content so that we could look into the example more easily.

C#
interface IOperation
{
    int Do(int a, int b);
}
class And : IOperation
{

    public int Do(int a, int b) { return a & b; }
}
class Or : IOperation
{
    public int Do(int a, int b) { return a | b; }
}
class Program
{
    static volatile int _i;
    static void Main()
    {
        for (int i = 0; i < 10; ++i)
        {
            IOperation op = (i % 2 == 0) ?
                (IOperation)new And() : (IOperation)new Or();
            _i = op.Do(5, 3);
        }
    }
}

In this case, we have a single call site for IOperation.Do, and it is impossible for the JIT to statically determine which implementation will be called. Therefore, direct in-lining or direct method dispatch is impossible. What is, then, the code that gets emitted in this case?

Let's first have a look at the call site itself, which is the Main entry point. It is compiled when the entry point is called, so we should be able to look at the code immediately. The following is the code only for the actual call to IOperation.Do inside the loop.

ASM
007E00A4  push        3    
007E00A6  mov         edx,5                            ; setup parameters
007E00AB  call        dword ptr ds:[2C0010h]           ; the actual call
007E00B1  mov         dword ptr ds:[002B2FE8h],eax     ; save return value

Note that this is not the interface method dispatch pattern we were supposed to see here. In its stead, we get an indirect call through the address 002C0010. You should remember this address because we will mention it in the following discussion. Stepping further inside, we see:

ASM
002C6012  push        eax  
002C6013  push        30000h 
002C6018  jmp         79EE9E4F

The actual implementation has not been compiled yet, and therefore, we find ourselves tracing through the code of the JIT-compiler. Eventually, we are redirected (through a complex series of jumps) to the actual interface method's code. Issuing the ret instruction from there gets us back to the main loop (where the call instruction has been issued, at 007E00AB).

However, after the loop has run three times, the code that 002C0010 points to (recall this is where our original call site is going through) is back-patched to an optimized version. This optimized version follows:

ASM
002D7012  cmp         dword ptr [ecx],2C3210h 
002D7018  jne         002DA011 
002D701E  jmp         007E00D0

Recall that a .NET object header starts with the object's method table, and there we are at the trivial profiling optimization: if the method table is the expected one (i.e., the "common" or "hot" implementation), we can perform a direct jump to that code. (Note that its actual address is embedded within the instruction, because the JIT has back-patched this code with the complete and intimate knowledge of the method's location. This means no extra memory access is required to dispatch this call.) Otherwise, we must go through the usual dispatching layer, which we will see in a moment. For completeness, here is the code at 007E00D0 (which is the "expected" result of the CMP instruction):

ASM
007E00D0  and         edx,dword ptr [esp+4] 
007E00D4  mov         eax,edx 
007E00D6  ret         4

This is simply the And.Do implementation. Note that the JIT-compiled code was not in-lined into the call site or in-lined into the dispatching helper. However, jumping directly to this code should cause as little overhead as possible. The remaining question is: what's at 002DA011, or, in other words, what happens if the method table is not the expected one? This time, the code is significantly more complex.

ASM
002DA011  sub         dword ptr ds:[14D3D0h],1 
002DA018  jl          002DA056
002DA01A  push        eax  
002DA01B  mov         eax,dword ptr [ecx] 
002DA01D  push        edx  
002DA01E  mov         edx,eax 
002DA020  shr         eax,0Ch 
002DA023  add         eax,edx 
002DA025  xor         eax,3984h 
002DA02A  and         eax,3FFCh 
002DA02F  mov         eax,dword ptr [eax+151A6Ch] 
002DA035  cmp         edx,dword ptr [eax] 
002DA037  jne         002DA04B 
002DA039  cmp         dword ptr [eax+4],30000h 
002DA040  jne         002DA04B 
002DA042  mov         eax,dword ptr [eax+8] 
002DA045  pop         edx  
002DA046  add         esp,4 
002DA049  jmp         eax
002DA04B  pop         edx  
002DA04C  push        30000h 
002DA051  jmp         79EED9A8 
002DA056  call        79F02065 
002DA05B  jmp         002DA01A

I've marked the most significant sections so we can focus on them. First, 1 is decremented from a global variable. Its initial value is 0x64 (i.e., 100). We will see its purpose in a moment. If the resulting value is less than 0, we jump to a call to one of the JIT back-patching functions, and continue the flow. What's in that flow? The normal interface method dispatching as performed by the JIT. Note that, eventually, there's a JMP EAX instruction at 002DA049, which actually gets us to the required code:

ASM
007E00F0  or          edx,dword ptr [esp+4] 
007E00F4  mov         eax,edx 
007E00F6  ret         4

This is clearly the Or.Do implementation. Right, so everything seems peachy. What is the purpose of this global variable we just saw, then? Consider the optimization we discussed earlier, with the "hot" path being in-lined (or at least, directly jumped to) if the method table of the actual object matches the method table for the "hot" implementation. It might be the case that the frequency of calls through each of the implementations changes dynamically during run-time. For example, it might be the case that for the first 500 times, the user calls And.Do, but for the next 5000 times, he will call Or.Do. This makes our optimization look a bit stupid, as we have, in fact, optimized for the least common case. To prevent this scenario, a counter is established for each call site. It is decremented every time there's a "miss" – i.e., when the method table of the actual object does not match the expected method's method table. When this counter decreases below 0, the JIT back-patches the code that 002C0010 points to again, to be the following:

ASM
0046A01A  push        eax  
0046A01B  mov         eax,dword ptr [ecx] 
0046A01D  push        edx  
0046A01E  mov         edx,eax 
0046A020  shr         eax,0Ch 
0046A023  add         eax,edx 
0046A025  xor         eax,3984h 
0046A02A  and         eax,3FFCh 
0046A02F  mov         eax,dword ptr [eax+151A74h] 
0046A035  cmp         edx,dword ptr [eax] 
0046A037  jne         0046A04B
0046A039  cmp         dword ptr [eax+4],30000h 
0046A040  jne         0046A04B
0046A042  mov         eax,dword ptr [eax+8] 
0046A045  pop         edx  
0046A046  add         esp,4 
0046A049  jmp         eax
0046A04B  pop         edx  
0046A04C  push        30000h 
0046A051  jmp         79EED9A8 
0046A056  call        79F02065 
0046A05B  jmp         0046A01A

Again, I've emphasized the important parts. Without getting into much detail, the purpose of this code is to check whether the current object's type matches the last object's type (note that unlike the previous snippet, the check here is not against a literal constant address – instead, it is calculated here using the object pointer itself). If there's a match, the location to jump to is calculated, and the JMP EAX executes (at 0046A049). If there isn't a match, the JIT's back-patching code is called again, and the process repeats.

Note that this code is less efficient than the state we had before the counter decremented itself below 0. Back then, we had a direct jump to a literal constant address. Now, we have a calculation of the jump address based on the object pointer itself. Also, note that, this time there is no counter – every time there is a miss, the meaning of this code will change. Summarized in pseudo-code, this looks somewhat like the following:

start: if (obj->Type == expectedType) {
      // Jump to the expected implementation
}
else {
      expectedType = obj->Type;
      goto start;
}

This is the final behavior that we get for the entire course of the program. It means that per call site, we get a one-shot counter (initialized to 100) which counts the times the "hot" implementation was missed. After the counter decays below 0, JIT back-patching is triggered, and the code is replaced with the version we just saw, that alternates the "hot" implementation with each miss.

Note that the 00C20010 stub is generated for each call site that attempts to dispatch an interface call. This means that the counter data and related code are per-call-site optimization, which can be highly valuable in certain scenarios.

Testing the above hypothesis is fairly trivial by writing a test program that performs interface method dispatching. In one mode, the program will call the first implementation in a loop and then call the second implementation in a loop; in another mode, the program will interleave every call to the first implementation with a call to the second implementation. Granted the behavior of the final dispatching code that we just saw, it is reasonable to expect that the first test case will perform better than the second one. This was tested using the following program:

C#
interface IOperation
{
    int Do(int a, int b);
}
class And : IOperation
{
    public int Do(int a, int b) { return a & b; }
}
class Or : IOperation
{
    public int Do(int a, int b) { return a | b; }
}
class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Method(IOperation op)
    {
    _i = op.Do(5, 3);
    }
    static readonly int NUM_ITERS = 100000000;
    static readonly int HALF_ITERS = NUM_ITERS / 2;
    static volatile int _i;
    static void Main()
    {
        IOperation and = new And();
        IOperation or = new Or();
        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < HALF_ITERS; ++i)
        {
            Method(and);
            Method(and);
        }
        for (int i = 0; i < HALF_ITERS; ++i)
        {
            Method(or);
            Method(or);
        }
        Console.WriteLine("Sequential: {0} ms", sw.ElapsedMilliseconds);
        sw.Reset();
        sw.Start();
        for (int i = 0; i < HALF_ITERS; ++i)
        {
            Method(and);
            Method(or);
        }
        for (int i = 0; i < HALF_ITERS; ++i)
        {
            Method(and);
            Method(or);
        }
        Console.WriteLine("Interleaved: {0} ms", sw.ElapsedMilliseconds);
    }
}

The test results on my laptop averaged to 2775ms for the sequential case, and 2960ms for the round-robin (interleaved) case. The difference was consistent, but it's obviously not very noticeable. Therefore, I conclude for now that the two usage patterns have little (if any) effect on the program's performance, especially if the methods are more "chunky" than just a single x86 instruction.

Other Tidbits

In the interests of completeness, the 64-bit JIT produces the same code for the sealed dispatch scenario as the 32-bit one. This seems clearly like a design decision. If you're very interested in what a 64-bit virtual method dispatch looks like, here it is:

ASM
00000642`8015047d 488b03 mov rax, qword ptr [rbx]
00000642`8015048b 488bcb mov rcx, rbx
00000642`8015048e ff5060 call qword ptr [rax+60h]

So again, we're calling through a method table even though the static and dynamic types are known in advance. (RBX holds the parameter value, RCX is setup to the same because it has to hold this, and then the call is through RAX+60h).

History

  • Version 1 - May 2008.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Chief Technology Officer SELA Group
Israel Israel
Sasha Goldshtein is the CTO of SELA Group, an Israeli company specializing in training, consulting and outsourcing to local and international customers.

Sasha's work is divided across these three primary disciplines. He consults for clients on architecture, development, debugging and performance issues; he actively develops code using the latest bits of technology from Microsoft; and he conducts training classes on a variety of topics, from Windows Internals to .NET Performance.

You can read more about Sasha's work and his latest ventures at his blog: http://blogs.microsoft.co.il/blogs/sasha

Sasha writes from Jerusalem, Israel.

Comments and Discussions

 
Questionvery nice Pin
BillW3323-Aug-12 7:04
professionalBillW3323-Aug-12 7:04 
GeneralMy vote of 2 Pin
rm82219-Jan-10 6:19
rm82219-Jan-10 6:19 
GeneralVery Interesting article Pin
Ralph Varjabedian30-Jul-08 23:33
Ralph Varjabedian30-Jul-08 23:33 
GeneralOptimizer tips Pin
Andrew Phillips8-Jul-08 16:42
Andrew Phillips8-Jul-08 16:42 
GeneralRe: Optimizer tips Pin
Sasha Goldshtein10-Jul-08 21:54
Sasha Goldshtein10-Jul-08 21:54 
GeneralOne hint... Pin
User 32754408-May-08 22:19
User 32754408-May-08 22:19 
GeneralThanks for such a brilliant article. Pin
SidharthaShenoy7-May-08 0:45
SidharthaShenoy7-May-08 0:45 
GeneralCLR 2.0 stub-based dispatch Pin
codekaizen6-May-08 10:30
codekaizen6-May-08 10:30 
GeneralRe: CLR 2.0 stub-based dispatch Pin
Sasha Goldshtein6-May-08 23:28
Sasha Goldshtein6-May-08 23:28 
GeneralRe: CLR 2.0 stub-based dispatch Pin
codekaizen7-May-08 6:34
codekaizen7-May-08 6:34 
GeneralRe: CLR 2.0 stub-based dispatch [modified] Pin
User of Users Group13-May-08 9:01
User of Users Group13-May-08 9:01 
Very simple reason why it will not beat out anything: meta-programming, you know, done properly at compile-time.

Sounds like an interesting work you're reading on but we are far from any of those techniques being anything fast or practical. From what I understand research is only half way there with modern vision problems and it will take too long and too many processor shifts before it becomes available and/or viable.

While on the subject, funny how they all jump against Java on plenty of places here, which turns out to be faster in far too many cases to list (apart from native controls, os/threading/cache/etc support and so on).

More interesting is that all this is coming at the time they are optimising everything for 3.5 SP1 and W*F*Bloat by cutting down, focusing on whatever they can just not to adress the problem everyone screams about, instead of fixing that overhyped JITer and base framework to not leak so damn much (who cares about start up if at runtime a GUI keeps hammering CPU at 60% for most simplistic examples of utterly wasteful 'coolness' and glue code that pulls in trillions of instructions and 60MB in 'binding')

This is just my personal opinion, but the JIT (+/-NGEN) has problems with older machines (most evident if you type 'Fatal Execution Error' and track the results) and I think one of the reasons they are not looking to break it further.

Again, so funny Java can knock it out..

modified on Tuesday, May 13, 2008 3:08 PM

GeneralEnlightening Pin
Jonathan C Dickinson5-May-08 2:37
Jonathan C Dickinson5-May-08 2:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.