Brief considerations
First of all, I apologize for my imperfect English, since my native language is Portuguese.
The source code for this article is in VC++6 project format, but the article applies also for other versions, like .NET and even VC++5.
I'm assuming that you have a previous knowledge on generating, using and interpreting listing files, as well some minimal knowledge in Assembly language. I will abstain me here on explaining these details, but if you are not familiar with listing files, you can get these information by reading this very good article.
Introduction
We all know that many compilers give us several options for code optimization, including here the Visual C++ and even the Visual Basic compiler. We know also that there are two kinds of optimization more commonly used:
- Minimize Size
- Maximize Speed
We use it with a relative frequency, but hardly stop for thinking in how all this work or when we should prefer one over other. In a first view, we ask: how the compiler can build different code-objects if our sorce code is the same?
To answer this question, we need to examine what in fact is generated by the compiler. This can be done through the listing files.
Inside optimizations
To start, let's examine the following code:
#include <stdio.h>
char* hello = "Hello World!";
int main(void)
{
long seip,eeip;
__asm {
xor eax,eax
call jump1
jump1:
mov eax,[esp]
mov seip,eax
}
printf("%s\n",hello);
__asm {
xor eax,eax
call jump2
jump2:
xor eax,eax
mov eax,[esp]
sub eax,0Dh
mov eeip,eax
}
printf("Code size: %d bytes",eeip-seip);
return 0;
}
The purpose of this code is simple: we will get the size of the generated machine code, in bytes, for the source code being analyzed. (in this case, just a printf
), through the difference between the EIP
address before and after our code. Notice that MASM and, consequently, inline assembly code in Visual C++ does not allow direct access to EIP
, then this procedure was emulated through a call
instruction (that is equivalent to: push EIP + jmp LABEL
).
Compiling the code using size optimization (select OptSize configuration in the example project) and executing the program, we will get an output:
$ Hello World!
$ Code size: 18 bytes
Compiling with speed optimization (select OptSpeed configuration in the example project) and executing the program, we will get an output:
$ Hello World!
$ Code size: 19 bytes
We have now 1 more byte in machine code, without have added anything in the source code!
Let's examine the listing files for both compilations to see how this "magic" happened. (Using the option Assembly, Machine Code, and Source - the listing files will have a .cod extension)
Case A: Minimize Size
00012 ff 35 00 00 00
00 push DWORD PTR ?hello@@3PBDB
00018 68 00 00 00 00 push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@
0001d e8 00 00 00 00 call _printf
00022 59 pop ecx
00023 59 pop ecx
Case B: Maximize Speed
0013 a1 00 00 00 00 mov eax, DWORD PTR ?hello@@3PBDB
00018 50 push eax
00019 68 00 00 00 00 push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@
0001e e8 00 00 00 00 call printf
00023 83 c4 08 add esp, 8
In Case A our variable hello
is allocated in the stack directly by it pointer, consuming 6 bytes of code, while in Case B the address of our variable is firstly copied in eax
and then allocated through push eax
, consuming 6 bytes too, but with 2 instructions (mov + push
). We ask now: what's the difference? Take a look in the following tables:
Case A: Minimize Size
INSTRUCTION | CLOCKS* | SIZE |
push DWORD PTR ?hello@@3PBDB | 4 | 6 |
push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@ | 1 | 5 |
call _printf | 3 | 5 |
pop ecx | 1 | 1 |
pop ecx | 1 | 1 |
TOTALS | 10 | 18 |
Case B: Maximize Speed
INSTRUCTION | CLOCKS* | SIZE |
mov eax, DWORD PTR ?hello@@3PBDB | 1 | 5 |
push eax | 1 | 1 |
push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@ | 1 | 5 |
call _printf | 3 | 5 |
add esp, 8 | 1 | 3 |
TOTALS | 7 | 19 |
* Clock counts for the program running in an I486 machine
Then, for the same code we have, in Case A, 18 bytes of code, consuming 10 machine cycles and in Case B, 19 bytes of code but consuming 7 machine cycles.
Pay attention in the first instruction in Case A and the 2 first instructions in Case B: In Case A, only one instruction is used (push
), but it involves the use of a memory address operand (DWORD PTR
), and this consumes more cycles than if using a register as operand. This is done in Case B, by storing the memory address in a register (eax
, in this case) and then applying push to this register. Even having two instructions in despite of only one, the operation will be executed more fast. This is, basically, the "magic" associated to the optimization: when the objective is speed, prioritize at maximum operations involving registers directly and avoiding (when possible) operations involving memory operands.
But we can get more from this code. Take a look in the stack frame creation for both, A and B cases:
Case A: Minimize Size
INSTRUCTION | CLOCKS* | SIZE |
push ebp | 1 | 1 |
mov ebp, esp | 1 | 2 |
push ecx | 1 | 1 |
push ecx | 1 | 1 |
TOTALS | 4 | 5 |
Case B: Maximize Speed
INSTRUCTION | CLOCKS* | SIZE |
push ebp | 1 | 1 |
mov ebp, esp | 1 | 2 |
sub esp, 8 | 1 | 3 |
TOTALS | 3 | 6 |
* Clock counts for the program running in an I486 machine
Notice that that the last 2 push
instructions in Case A does the same thing that the sub
instruction in Case B, but only a little slow and with a little less size. Since ecx
is a 32-bit register (4 bytes), at the end of the 2 push
instructions, the stack pointer will have a displacement of 8 bytes (we have two dword
local variables), that is the same that sub esp,8
in Case B. This, by itself, explain why our code, in Case A have a pop
sequence, while in Case B have an add
instruction to restore the stack (remember: we are using C calling convention).
Mixing models: Hybrid optimization
Extending the previous analysis, notice that, for the same operation, we have a code size of 6 bytes, that in Case A consumes 4 cycles and in Case B consumes only 2 cycles. Why not to substitute this instruction in Case A by the instructions of Case B? This will result in a hybrid optimization model, where we have simultaneously a small and fast code. Applying this substitution (manually), we have:
Hybrid Optimization Model
INSTRUCTION | CLOCKS* | SIZE |
mov eax, DWORD PTR ?hello@@3PBDB | 1 | 5 |
push eax | 1 | 1 |
push OFFSET FLAT:??_C@_03HHKO@?$CFs?6?$AA@ | 1 | 5 |
call _printf | 3 | 5 |
pop ecx | 1 | 1 |
pop ecx | 1 | 1 |
TOTALS | 8 | 18 |
* Clock counts for the program running in an I486 machine
We continue with the same 18 bytes of code as in Case A, but now consumming 8 cycles, that is almost the value for the Case B! You can now use MASM to assemble this new code (a .bat file is included in the example for this operation).
However, consider this hybrid model just as an exercise for understanding of this study, because in a real and extensive project, this manual process becomes unviable and susceptible to errors.
Optimization models: Choosing the best solution
Now, the questions that "stays resident" are:
- What's the difference if my code execute with only 3 cycles more or less?
- It's not more convenient to always choose the minimal size model?
It's a frequent doubt, but that can be answered taking in mind factors like:
- This is just a small example, where we have only a simple function being executed. Now imagine a real program, where we have countless operations. You can now imagine, proportionaly, how many machine cycles can be saved by maximizing the speed (or how many bytes of code can be saved by minimizing the size).
- Imagine our example in a loop with 1 million iteractions. For each cycle required by the code itself we have not only one, but 1 million cycles required until the end of this loop! In a modern computer, maybe this is imperceptible, but when running the program in an old 80386, for example, this difference is relevant.
- When operating in protected mode, many concurrent tasks are being executed and, in critical and complex tasks, such extensive mathematical operations in engineering applications, perhaps a fast code be more adequated.
Many other factors can be enumerated to justify the use of one or other model, and this demonstrates that the choice of a correct optimization model must be done considering factors like these. It's an objective and at same time subjective task, that many times is forgotten, or not considered, but that can determine the success of our projects.
Better code is made when the sun is in another side of the world. Let the night fall, then work.