The Intel Assembly Manual

Michael Chourdakis

5.00/5 (87 votes)

Jan 10, 2019

CPOL

58 min read

107592

All in one: x86, x64, Virtualization, multiple cores, along with new additions

Github link: https://github.com/WindowsNT/asm. The entire project.

Introduction

This is my full and final article about the Intel Assembly, it includes all the previous hardware articles (Internals, Virtualization, Multicore, DMMI) along with some new information (HIMEM.SYS, Flat mode, EMM386.EXE, Expanded Memory, DPMI information).

Reading this through will enable you to understand how the operating systems work, how the memory is allocated and addressed and, perhaps how to make your own OS-level drivers and applications.

To help you understand what's happening, the github project includes many aspects of the article (and I 'm still adding stuff). It's a ready to be run tool which includes a Bochs binary, VMWare and VirtualBox configurations and a Visual Studio solution. The entire project is build in assembly using Flat Assembler.

Assemblers like TASM or MASM will not work, for they only support specific architectures.

Bochs is the best environment to experiment, because it includes a hardware GUI debugger (I'm proud of developing it myself) which can help you understand the internals. Debugging without Bochs is impossible, because the debuggers are either real mode only (like MSDOS Debug) and assume you will always have some sort of control (which is not the case in most debugging areas), or are able to run only in an existing environment (like Visual Studio).

If you have good C knowledge, then this will be a benefit in understanding the internals. Asesmbly knowledge is recommended, but you can follow the article even if you know nothing about assembly.

Generic Information

Architecture and CPU

Assembly is a language that everything must be done manually. A single printf() call will perhaps take thousands of assembly instructions to execute. While this article does not attempt to teach you assembly, it would be necessary to bear in mind that really lots of things are needed even to achieve the smallest result (that is actually why higher level languages were created). Assembly language is also specific to the architecture (Here, we discuss Intel x86 and x64), whereas a language like C is portable.

Assembly has a small (comparatively) set of commands:

Commands that move data between various places
Commands that execute mathematic algorithms (simple to complex)
Commands that check conditions (like if)
Other commands (to be later discussed)

The CPU is the unit that executes assembly instructions. The way they are executed depends on the running mode of the processor, and there are 4 modes:

Real mode
Protected mode (in two vresions, segmented and flat)
Long mode
Virtualization (not exactly a mode, but we will talk about it later)

The next paragraphs in this chapter discuss various elements of the assembly language in general.

Memory

Physically, the memory is one big array. If you have 4GB, you could describe it as unsigned char mem[4294967295]. However, the way it is used greatly differs depending on the processor mode and the configuration of the operating system. Therefore, you do not access it as a big array.

Stack and Functions

Stack is special memory that is setup for temporary storage. Parameters passed to a function are "pushed" to the stack, when the function ends they are "popped" so the stack clears and C functions's local variables go there, that's why they vanish when the function terminates. The stack memory is, technically, nothing but normal memory used for special purposes.

This is (oversimplified for now) what approximately happens in assembly with a function:

int x(int a,int b)
{
return a + b;
}

int c = x(5,10); // result c = 15

x:
mov ax,[first stack element]
mov bx,[second stack element]
add ax,bx
ret 4

main:
push 5
push 10 ; the order is different, but let's forget about that now
call x
; ax contains the resuln

The variables "a" and "b" are "pushed" to temporary memory (which is now 4 bytes less if int = 16 bits). The function is called, and then it returns with the stack cleared and ax containing the return value. Note that the above is a big oversimplification of what the assembly code actually looks like, but let's pass for now.

Registers

In addition to memory, each CPU has some auxilliary places to store data, called registers. What registers are available depends on the current running mode. Some registers have special meanings, some are for generic purposes.

Interrupts

An interrupt is code that interrupts other running code. For the moment, just assume it's a function that can run while you are inside another function. There are interrupts that are automatically generated by the CPU (either hardware or when an exception occurs), and interrupts that are "called" by software. The way they work depends on the running mode, and there can be a maximum of 255 interrupts.

Exceptions

An exception is an interrupt triggered by either the CPU (for example, when a divide by zero occurs in your C++ code, int 00 functions are executed), or by using the API (via the throw keyword, for example), which generates a software interrupt. In the lower level we are discussing, there is no difference between exceptions and interrupts.

Now that we have an idea of the basics, let's proceed to CPU modes.

Real Mode

Architecture

Real mode is the oldest mode. DOS runs in it. Windows 3.0 also runs in it when started with the /r switch. Everything is 16 bit. It is the weakest mode of operation, but not the simplest one. Memory is addressed by an 20 bit controller, making possible to access up to 1MB memory. Available memory over this limit is useless in real mode.

Segmentation

Memory is not accessed as an array, but in segments. Each pointer is described by a 16 bit segment, which is a memory address divided by 16, and an offset, which describes how far from the offset we will go. So we will see some simple (in hex) examples:

0000:0000 -> memory address 0
0000:0010 -> memory address 16 (hex 10)
0001:0002 -> memory address 18. Segment 1*16 + offset 2
0010:0034 -> 0x10*16 + 0x34
0011:0024 -> 0x11*16 + 0x24, same pointer as above
FFFF:0010 -> Maximum available address, specifying more than 0010h results in wrapping around zero.

We can see that segments can overlap. Specifying 0ffffh segment and an offset larger than 0010h results in wrapping. A segment maximum capacity is 64KB. Although we can go up to a FFFF segment, only the lower 640KB were available for DOS applications, because the upper segments (over 0xA000) were reserved for the BIOS.

All segments have read/write/execute access from anywhere (that is, any program can read/write or execute code within any segment). Any application can read from or write to any part of memory, including the part in which the OS resides. That is why a real mode OS is a single tasking OS and if one app crashes, you have to reboot.

Registers

Real mode registers are 16 bits, and they include:

Four generic purpose registers: AX, BX, CX, DX. The upper 8 bit part of them can be accessed as AH, BH, CH, DH and the lower part as AL, BL, CL, DL.
A register to hold the offset of the currently executing code: IP.
Four registers to be used as pointers: SI, DI, BP, SP. SP points to the end of the available stack memory (it cannot be used as an index like the rest). Each time we push something to the stack, SP decreases. On POP, SP increases. These registers have no 8 bit splits.
Four registers to contain segments: CS, holding always the segment of the currently executing code, DS,ES and SS. SS holds the segment of the stack memory, DS holds the segment of the data, and ES is an auxilliary register.

So the code is always executing at CS:IP, and stack is pointed by SS:SP.

The 386 CPU adds more registers, also accessible in real mode:

32 bit extensions to the non segment registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EIP.
Two more auxilliary segment registers, GS and FS.
5 control registers, CR0, CR1, CR2, CR3, CR4.
6 debug registers, DR0, DR1, DR2, DR3, DR6, DR7, used for hardware breakpoints.

DS is the default data segment, unless else is specified or if SP or BP are used:

mov ax,[100] ; gets value from DS:100
mov ax,[si] ; gets value from DS:SI
mov ax,[es:si] ; from ES:SI
; When BP or SP is used, SS is the default.

ESI, EDI, EBP and ESP can be used as pointers. If their high bits are not zero, then an exception occurs (unless you are in Unreal mode, discussed below).

When REP operations are storing data (movsb, stosw etc), then when DI is used as an index, ES is the default segment.

COM and EXE files

A COM file is a memory map, fitting in one segment. The first 128 bytes contain the PSP, a data structure containing information, and the rest of the segment contains all code, data, and stack memory for the program. CS = DS = ES = SS. SP is set to 0xFFFE to point to the end of the segment. Execution starts from CS:IP = 0x100 (after the PSP).

An EXE file might have multiple segments, so an EXE can be more than 64KB. DS and ES initially point to the PSP. When an EXE is loaded, "relocations" are resolved. A relocation is a position within the executable that the assembler leaves as empty, to be filled with a segment value which would only be known at run time.

Interrupts

All the functions that DOS and BIOS provides are available through real mode software interrupts. In real mode, the first 1024 bytes of RAM (Starting at 0000:0000) contain a set of 256 segment:offset pointers to each interrupt. In 286+ this location can be changed by the LIDT command, which points to a 6 byte array:

Bytes 0-1 contain the full length of the IDT, maximum 1KB => 256 entries.
Bytes 2-5 contain the physical address of the first entry of the IDT, in memory.

Some interrupts are automatically issued by the processor when some event occurs. In real mode, the most significant are:

Interrupt 0, called on divide by zero.
Interrupt 1, called when using a debugger for single step.
Interrupt 3, called on breakpoints.
Interrupt 6, called on invalid opcode.
Interrupt 9, called on key press.

Software interrupts provide various services to real mode apps. The most important interrupts are:

0x10, BIOS display functions
0x13, BIOS disk functions
0x14, BIOS serial port functions
0x16, BIOS keyboard functions
0x17, BIOS parallel port functions
0x21, DOS functions (files, input, output, application, configuration etc)
0x2F, TSR functions
0x31, DPMI functions
0x33, Mouse functions

Using the excellent Ralf Brown Interrupt List you can learn about every interrupt in the world.

Models

Because of the segmented memory, different sets of programming models were created, which mostly resulted in incompatibilities between compilers and libraries. C pointers were described as near or far, depending on whether they included a segment or not:

The tiny model. Everything has to be included in a single segment (COM file). Pointers are near.
The small model. One segment for the code, one for the data. All pointers are near.
The medium model. One data segment, multiple code segments. Code pointers far, data pointers near.
The compact model. One code segment, multiple data segments. Code pointers near, data pointers far.
The large model. Multiple code and data segments, code and data pointers far. Single data structures still limited to 64KB.
The huge model. Multiple code and data segments, all pointers far.

Benefits

The only benefit in real mode is that you have DOS and BIOS functions available as software interrupts. Therefore, all techniques used by DOS extenders (which allowed applications to run in protected mode) involved temporarily switching to real mode to call DOS.

Here is a quick hello world in tiny model:

org 0x100 ; code starts at offset 100h
use16               ; use 16-bit code
mov ax,0900h
mov dx,Msg
int 21h
mov ax,4c00h
int 21h
Msg db "Hello World!$"

This very simple program calls two DOS functions. The first is function 9 (ah register) which accepts a pointer of the string to be written to the screen in DS:DX (DS already has the segment, it's a com file). The second is function 4C, which terminates the program.

Here is the same application in EXE format:

FORMAT MZ               ; DOS 16-bit EXE format
ENTRY CODE16:Main       ; Specify Entry point (i.e. the start address)
STACK STACK16:stackdata ; Specify The Stack Segment and Size
    
SEGMENT CODE16_2 USE16  ; Declare a 16-bit segment
    
    ShowMsg:
        mov ax,DATA16
        mov ds,ax            ; Load DS with our "default data segment"
        mov ax,0900h    
        mov dx,Msg    
        int 21h;            ; Call a DOS function: AX = 0900h (Show Message), 
                            ; DS:DX = address of a buffer, int 21h = show message 
    retf                    ; FAR return; we were called from 
                            ; another segment so we must pop IP and CS.
    
SEGMENT CODE16 USE16         ; Declare a 16-bit segment
    ORG 0                    ; Says that the offset of the first opcode 
                             ; of this segment must be 0.
    
    Main:
        mov ax,CODE16_2
        mov es,ax
        call far [es:ShowMsg] ; Call a procedure in another segment.
                              ; CS/IP are pushed to the stack.
        mov ax,4c00h          ; Call a DOS function: AX = 4c00h (Exit), int 21h = exit
        int 21h
    
SEGMENT DATA16 USE16
    Msg db "Hello World!$"
        
SEGMENT STACK USE16
    stackdata dw 0 dup(1024)  ; use 2048 bytes as stack. When program is initialized, 
                              ; SS and SP are automatically set.

How does the assembler know the actual value of the data16, code16, code16_2, and stack16 segments? It doesn't. What it does is to put null values, and then creates entries to the EXE file (known as "relocations") so the loader, once it copies the code to the memory, writes to the specified address, the true values of the segments. And because this relocation map has a header, COM files cannot have multiple segments even if they sum to less than 64KB in total.

This program calls a function ShowMsg in another segment via a far call, which uses a DOS function (09h, INT 21h) to display text.

Problems

If multiple applications are running, one application can overwrite any other without any notification.
Up to 1MB memory only, and the upper 384K were used by BIOS, so only 640K available.
Mixing far and near pointers between applications and libraries led to incompatibities and, usually, crashes.
If something wrong happens, the PC has to reboot.

Expanded Memory

To cope with the 640KB limitation, an additional compatible memory, called expanded memory or EMS memory was created. This was not a processor feature, but rather a set of hardware (ISA card) extensions which included a driver to perform bank switching, i.e. replace portions of memory installed with memory from that card. It offered up to 32MB more, but it was mapped to one of the high segments (A000, B000, C000, D000, E000 or F000), which means that this extra memory could not be available simultaneously. The expansion card came with a driver which had to be installed in config.sys and, using the LIM EMS protocol, offered the services via interrupt 67h.

Detecting EMS, by testing existence of a device called EMMXXXX0:

EMSName db 'EMMXXXX0',0
mov  dx,EMSName       ; device driver name
mov  ax,3D00h                ; open device-access/file sharing mode
int  21h
jc   NotThere
mov  bx,ax                   ; put handle in proper place
mov  ax,4407h                ; IOCTL  -  get output status
int  21h
jc   NotThere
cmp  al,0FFh
jne  NotThere
mov  ah,3Eh                  ; close device
int  21h
jmp  ItIsThere

Allocating EMS

Interrupt 0x67, AH = 0x43, BX = # of pages (1 page = 16KB)

Detect segment to be used

Interrupt 0x67, AH = 0x41

Save previous EMS map

Interrupt 0x67, AH = 0x47

Save previous EMS map

Interrupt 0x67, AH = 0x47

Map our allocated memory

Interrupt 0x67, AH = 0x44

Restore previous EMS map

Interrupt 0x67, AH = 0x48

Release EMS

Interrupt 0x67, AH = 0x45

Various other functions are provided by int 0x67.

A20 line

We saw that the maximum address is FFFF:0010, because increasing the offset results in wrapping. That is true because the 8088 CPU has only 20 bits of addressing. However 286+ added the 21th line (known as A20 line) and, when it is enabled, FFFF:0010 to FFFF:FFFF can be used without wrapping (an almost 64KB more). This memory (known as High Memory Area, HMA) is now accessible from real mode and it can be used by HIMEM.SYS to load parts of DOS in it and therefore make more low memory available for applications.

Enabling or disabling A20 manually requires us to communicate with the keyboard controller:

WaitKBC:
   mov cx,0ffffh
   A20L:
   in al,64h
   test al,2
   loopnz A20L
ret

ChangeA20:
   call WaitKBC
   mov al,0d1h
   out 64h,al
   call WaitKBC
   mov al,0dfh ; use 0dfh to enable and 0ddh to disable.
   out 60h,al
ret

Segmented Protected Mode

Architecture

Protected mode solves the real mode problems. In particular:

Up to 16 MB (286) and up to 4GB (386+) are directly accessible.
Memory access is checked, protections and protection levels are available.
If something wrong happens, the problem can be isolated and the rest of the applications are not affected.
There is 16-bit protected mode (286+) or 32-bit protected mode(386+)

DOS never ran in protected mode. Windows 3.0 run in 16-bit segmented protected mode, when started with the /s switch. Windows 95+, Linux and the rest of 32-bit OSes run in flat protected mode, but before checking the flat mode we will immerse in the complex mechanisms that protected mode has. Flat mode greatly simplifies many complex things in normal segmented protected mode.

Protected mode introduces "rings", that is, levels of authorization. There are four rings (Ring 0, 1, 2 and 3), in which the Ring 0 is the most authorized, where the Ring 3 is the less authorized. Code running in a less privileged ring cannot access (without the OS supervision) code in a higher ring.

Memory

Each segment in memory is not anymore fixed, nor it has a fixed 64KB size. A protected mode segment can have any size, from 1 byte to 4GB. Each segment has its own limitations (read, write, execute access) and its own protection ring.

Registers

The same set of registers that exist in real mode are available. Also, every register can be used as an index, for example mov ax,[ebx] will work.

Global Descriptor Table

The Global Descriptor Table (GDT) is a set of entries that describes all segments for the CPU. Each entry is 8 bytes long and has the following format:

Bits	Meaning
0-15	Limit low 16 bits
16-31	Base low 16 bits
32-39	Base medium 8 bits
40	Ac
41	RW
42	DC
43	Ex
44	S
45-46	Priv
47	Pr
48-51	Limit upper 4 bits
52-53	Reserved (0)
54	Sz
55	Gr
56-63	Base upper 8 bits

The base is a 32-bit value that indicates the physical memory that this segment starts at.
The limit is an 20- bit value indicating the length of the segment, depending on the Gr bit. If the Gr bit is 1, then the actual limit is the limit value * 4096.
The Ex flag is 1, to indicate a code segment, or 0, to indicate a data segment.
The DC flag has different meaning, depending on the Ex flag:
- For code segment (Ex = 1), if DC is 0 then the segment is non conforming. A non conforming segment can only be called from a segment with the same privilege level. If RW is 1 then the segment is conforming and can be also called from segments with higher privilege. For example, a ring 3 conforming segment can be called from a ring 2 segment.
- For data segment (Ex = 0), if DC is 0 then the data segment expands up, else it expands down. For an expanding down segment, it starts from its limit and ends to its base, with the address going the reverse way. This flag was created so a stack segment could be easily expanded, but it is not used today.
The RW flag has different meaning, depending on the Ex flag:
- For code segment (Ex = 1), if 0, then the segment is not readable. If 1, then the code segment is readable.
- For data segment (Ex = 0), if 0, segment is read only, else read-write.
  
  Note that a code segment is not writable. However, because segment base addresses can overlap, you can create a writable data segment with the same base address and limit of a code segment.
The Pr indicates the current ring (00 to 11)
The Ac bit indicates access. The CPU sets this bit each time the segment is accessed, so the OS gets an idea how frequent is the access to the segment, so it knows if it can cache it to disk or not.
The S bit must be 1 for code and data segments, and 0 for system segments (see below).
The Pr bit can be set to 1 to indicate that the segment is present in memory. If the OS caches this segment to the disk, then it sets Pr to 0. Any attempt to access the removed segment causes an exception. The OS catches this exception, and reloads the segment to memory, setting Pr to 1 again.
The Sz bit can have two values:
- 0, in which case the default for opcodes is 16-bit. The segment can still execute 32-bit commands (386+) by putting the 0x66 or 0x67 prefix to them.
- 1 (386+), in which case the default for opcodes is 32-bit. The segment can still execute 16-bit commands by putting the 0x66 or 0x67 prefix to them.

In real mode, the segment registers (CS, DS, ES, SS, FS, GS) specify a real mode segment. And you can put anything to them, no matter where it points. And you can read and write and execute from that segment. In protected mode, these registers are loaded with selectors. The selectors are indices to the GDT and have the following format:

Bits	Meaning
0-2	RPL. Requested protection level, must be equal or lower to the segment PL.
2	0 to take the entry from GDT, 1 from the LDT (see below)
3-15	0-based index to the table.

In protected mode, you can't just select random values to the segment registers like in real mode. You must put valid values or you will get an exception. The exception is the first entry in the GDT table, which is always set to 0. CPU does not read information from entry 0 and thus it is considered a "dummy" entry. This allows the programmer to put the 0 value to a segment register (DS, ES, FS, GS) without causing an exception.

The GDT is loaded to the CPU by executing the LDGT command, which points to a 6-byte array:

Bytes 0-1 contain the full length of the GDT, maximum 4KB => 4096 entries.
Bytes 2-5 contain the physical address of the first entry of the GDT, in memory.

Interrupts

The interrupt table is now 8 bytes long for each defined interrupt, having the following structure:

struc IDT_STR 
{
 .ofs0_15 dw ofs0_15
 .sel dw sel
 .zero db zero
 .flags db flags            ; 0 P,1-2 DPL, 3-7 index to the GDT
 .ofs16_31 dw ofs16_31
}

Each interrupt also has a protection level. The LIDT command has the same functionality as in real mode, pointing to an 6 byte array (containing the size and the physical location of the first entry).

After the LIDT command is executed, real mode interrupts no longer work, so a real mode debugger is useless.

Local Descriptor Table

Local Descriptor Table (LDT) is a method for each application, on multitasking scenarios, to have a private set of segments, loaded with the LLDT assembly instruction. The LDT bit in the selector specifies if the segment loaded is from the GDT or from the LDT.

System Segments in the GDT

When the S bit in the GDT is 0, this indicates a system-related segment. In this case, GDT entries describe three kinds of system segments:

Task Segments
Call Gates
Interrupt Gates
Trap Gates (same as interrupt gates, with the exception that when a trap occurs, interrupts are still enabled)

Bits 40-43 in a GDT entry have the following meaning:

0000 - Reserved
0001 - Available 16-bit TSS
0010 - Local Descriptor Table (LDT)
0011 - Busy 16-bit TSS
0100 - 16-bit Call Gate
0101 - Task Gate
0110 - 16-bit Interrupt Gate
0111 - 16-bit Trap Gate
1000 - Reserved
1001 - Available 32-bit TSS
1010 - Reserved
1011 - Busy 32-bit TSS
1100 - 32-bit Call Gate
1101 - Reserved
1110 - 32-bit Interrupt Gate
1111 - 32-bit Trap Gate

Call Gates

Call gates are a mechanism to switch from a low privilege code to a higher one, used for user-level code to call system-level code. You specify a 1100 type entry in the GDT with the following format:

Hide Copy Code

struct CALLGATE
{
    unsigned short offs0_15;
    unsigned short selector;
    unsinged short argnum:5;  // number of arguments to copy to the stack from the current stack
    unsigned char r:3; // Reserved
    unsigned char type:5; // 1100
    unsigned char dpl:2; // DPL of this gate
    unsigned char P:1; // Present bit
    unsigned short offs16_31;

};

Using CALL FAR with the selector of this callgate (the offset is ignored) will switch to the gate and execute the higher level privilege commands. If argnum specifies parameters to be copied, the system copies them to the new stack after pushing SS,ESP,CS,EIP. Using RETF will return from the gate call.

Call gates are slow mechanisms to transit between rings in the CPU.

TSS Descriptors, Task Gates and Hardware Multitasking

Having the ability to hold Task Segments in the GDT and Local Descriptor Tables, CPUs provide the ability for task switching. The Task State Segment is where the CPU saves information about a local task (the current registers). Executing a far JMP or a CALL (offsets are ignored like in call gates) with a selector pointing to a GDT TSS will "switch" to that task, restoring saved registers. The TSS descriptor is used to specify the base address and limit of the TSS to be used to load the new CPU state from. The CPU has a register named Task Register which tells which TSS will receive the old CPU state. When the TR register is loaded with an LTR instruction the CPU looks at the GDT entry (specified with LTR) and loads the visible part of TR with the GDT entry, and the hidden part with the base and limit of the GDT entry. When the CPU state is saved the hidden part of TR is used.

In addition to the far call and jmp, a context switch can be triggered by a using a Task Gate Descriptor. Unlike TSS Descriptors, task-gate descriptors can be in the GDT, LDT or IDT (so you can force a task switching when an interrupt occurs).

Entering protected mode

The steps to follow are:

Enable A20
Set the GDT
Set the IDT (if you need interrupts in protected mode)
Enter protected mode with the MSW or the CR0 register.

You use the MSW register (in 286), or, in 386+ CR0:

; 386+
mov eax,cr0
or eax,1
mov cr0,eax

; 286 
smsw ax
or al,1
lmsw ax

After that, you must execute a far jump to a protected mode code segment in order to clear possible invalid command cache. If this code segment is a 16-bit code segment, you must do:

db 0eah    ; Opcode for far jump
dw StartPM ; Offset to start, 16-bit
dw xx      ; A selector value in the GDT, with the Sz bit off.

If this code segment is a 32-bit code segment, you must do:

db 66h     ; Prefix for 32-bit
db 0eah    ; Opcode for far jump
dd StartPM ; Offset to start, 32-bit
dw xx      ; A selector in the GDT, with the Sz bit on.

Also you must setup the stack and other registers:

mov ax, data_selector
mov ds,ax
mov ax, stack_selector
mov ss,ax
mov esp,1000h ; assuming that the limit of the stack segment 
              ; selected by stack_selector is 1000h bytes.
sti
...

Exiting protected mode

cli
mov eax,cr0
and eax,0ffffffeh
mov cr0,eax
mov ax,data16
mov ds,ax
mov ax,stack16
mov ss,ax
mov sp,1000h ; assuming that stack16 is 1000h bytes in length
mov bx,RealMemoryInterruptTableSavedWithSidt
litd [bx]
sti
; (Real mode debugger works here) ...

In 286, you cannot get back to real mode because a LMSW ax to remove the protected mode flag results in a processor reset, keeping the memory intact. 286 forces this reset and puts a routine to be executed after the reset with the following code:

MOV ax,40h 
MOV es,ax 
MOV di,67h 
MOV al,8fh 
OUT 70h,al 
MOV ax,ShutdownProc 
STOSW 
MOV ax,cs
STOSW 
MOV al,0ah 
OUT 71h,al 
MOV al,8dh 
OUT 70h,al

In 386+, normal exit back to the real mode can be done.

Problems

While you can access all the memory directly, there is still a lot of segmentation and slow task switching or slow movement between rings.

Flat Protected Mode

Paging

Paging is the method to redirect a memory address to another address. The requested address is called linear address and the target address is called physical address. When a linear address is the same as a physical address, we say that we are in a "see through" area.

To accomplish paging, two tables are used: the page directory and the page table.

The Page Directory is an array of 1024 32-bit entries with the following format:

P,R,U,W,D,A,N,S,G,AA,Addr

P - Page is present in memory. This flag allows the OS to cache the pages back to disk , clear P, and reload them when a page fault is generated when software attemps to access the page.
R - Page is Read Write if set, else Read only. This restriction applies only to ring 3 unless the WP bit in CR0 is set.
U - If unset, only ring 0 can access this page.
W - If set, write-through is enabled.
D - If set, the page will not be cached. The CPU caches the page tables in it's Translation Lookaside Buffer (TLB).
A - Set when the page is accessed (not automatically, like the GDT bit).
N - Set to 0.
S - Set to 0. If Page Size Extensions (PSE) are enabled, S can be 1, in which case the page size is 4MB instead, and the pages must be 4MB aligned. This mode is introduced to avoid lots of small pages, at the expense of more memory wasted if the needed memory is somewhat larger than 4MB. Fortunately, modes can be mixed.
G - Set to 0.
Addr - The upper 20 bits (the lower 12 are ignored because it must be 4096- aligned) of the Page Table entry that this Page Directory entry points to.

The Page Table is an array of 1024 32-bit entries with a similar format:

P,R,U,W,C,A,D,N,G,AA,Addr

The C bit is the same as the previous D bit
The D bit is used to mark dirty pages (pages that have been written) by the OS.
The G flat, if set, prevents caching in the TLB.
The Addr is the 4096-aligned physical address that this entry points to. The virtual address is calculated from the offset in the page directory and the offset in the page table.

To enable paging:

Load CR3 with the address of the first entry in the Page Directory (must be 4096-aligned).
Set CR0 bit 31. This requires protected mode, with the exception of LOADALL (see below).

Once the tables are loaded, they are cached into TLB. Reloading the CR3 will reset the cache. 486+ also has an INVLPG instruction to reset only a particular page cache, not the entire TLB.

Architecture

The segmented protected mode is very complex. Using paging, protected mode can be "flat", enabling the following:

All processes get an 4GB virtual address space. Protection is done at the paging level. All segments are 4GB, all segment selectors always point to the same segment.
Programming is way simpler since only "near" pointers are needed.
The OS can map shared libraries (residing once in physical memory) to multiple virtual destinations per application.
The application only sees memory paged to its own virtual address space, so processes are protected by hardware.

In addition, all modern OSes now use only 2 of the 4 protection rings, ring 0 for their kernel and ring 3 for all the user applications. Call gates are no more used.

SYSENTER/SYSEXIT

To make transitions between user mode (ring 3) and kernel mode (ring 0) faster, a method other than call gates had to be implemented. SYSENTER/SYSEXIT instructions are the current way to switch from ring 3 to ring 0. You will use WRMSR to set the new values for CS (0x174) , ESP (0x175) and EIP (0x176). ECX must hold the ring 3 stack pointer for SYSEXIT and EDX contains the ring 3 EIP for SYSEXIT. The entry stored for CS must be the index to 4 selectors, the first is the ring 0 code, the second is the ring 0 data, the third is the ring 3 code and the fourth is the ring 4 data. These values are fixed, so in order to use SYSENTER your GDT table must contain these entries in this format.

These opcodes only support switching between ring 3 and ring 0, but they are much faster. They are used today instead of the way slower call gates.

Software multitasking

Task gates are no longer used by today's operating systems. Instead, they apply software multitasking to switch between processes:

A "scheduler" (an interrupt timer) is run.
It switches stack and EIP based on thread and process priorities.

Because a software scheduler saves only what is necessary for task switching, it is faster than the segmented mode hardware switching.

Protected Mode Facts

Unreal mode

Because protected mode cannot call DOS or BIOS interrupts, it is generally not very useful to DOS applications. However, a 'bug' in the 386+ processor turned out to be a feature called unreal mode. The unreal mode is a method to access the entire 4GB of memory from real mode. This trick is undocumented, however a large number of applications are using it. The trick is based on the fact that a segment selector can originally point to a 4GB data segment (set in the GDT), and when it goes back to the real mode its "invisible part" remains intact and still having a 4GB limit.

To use unreal mode, you must:

Enable A20.
Enter protected mode.
Load a segment register (ES or FS or GS) with a 4GB data segment.
Return to real mode.

After returning from protected mode, you can easily do:

; assuming FS has loaded a 4GB data segment from Protected Mode
mov ax,0
mov fs,ax
mov edi,1048576 ; point above 1MB
mov byte [fs:edi],0 ; Set a byte above 1MB.

286 lacks this capability because to exit protected mode, the CPU has to be reset, so all registers are destroyed (but see LOADALL below).

Huge real mode

The above unreal mode theory can be applied to CS as well, making it possible to execute code at a position over 1MB when EIP > 0xFFFF. However when calling an interrupt, the upper 16 bits of EIP are not pushed to the stack, so on return you will not return where you were. Therefore, huge real mode was not very much used.

LOADALL

At that time, a now non-existent and mostly undocumented instruction existed, LOADALL (0xF 0x5 in 286, 0xF 0x7 in 386). LOADALL used, as the name implies, to load all the registers (including the GDTR and IDTR) from one table in memory. In 286 LOADALL (which was not accessible from 386), this table was fixed at memory address 0x800, whereas in 386 LOADALL it reads the buffer pointed to by real mode ES:EDI. Because the CPU does not check in any way if any of the values loaded by LOADALL are valid, LOADALL was used by many tools at the time, including HIMEM.SYS, for various infamous actions:

To access the entire memory from real mode without entering protected mode and unreal mode.
To run real code with paging.
To run 32bit code in real mode.
To run normal 16-bit code inside protected mode without VM86 (which was not there in 286). This was done by trapping each memory access (which would lead to GPF because all the segments were marked non-present) and emulating the desired result by using another LOADALL. Of course this was too slow, but it led to the creation of the VM86 mode in 386, where LOADALL eventually faded out.

LOADALL cannot switch the 286 back to real mode, but using LOADALL removes the need to enter protected mode altogether.

LOADALL 286 itself was mentioned in the manuals and was partially documented; by contrast, LOADALL 386 was heavily obscure, probably to induce the programmers to take advantage of the new VM86 mode.

HIMEM.SYS

Protected mode is complex and, without a debugger available, it is prone to lots of unsolvable crashes. To help the programmers, Microsoft created a driver that was able to manage protected mode from a normal 16-bit DOS application, allowing it to access high memory. that time, extended memory was mostly, if not totally, used to cache data from the disk, especially from big apps. HIMEM puts the CPU in unreal mode (or it uses LOADALL in 286) and provides a simple interface to the applications that want more memory without messing with the protected mode details. By enabling the A20 line, HIMEM allowed a portion of DOS COMMAND.COM to reside in the high memory area when config.sys had a DOS=HIGH directive.

Detect HIMEM.SYS

Interrupt 0x2F, AX = 0x4300

Return HIMEM.SYS function pointer

Interrupt 0x2F, AX = 0x4310

All the following functions are provided from the function at the returned ES:BX from the above interrupt.

Detect/Enable/Disable A20

AH = 0x7 (detect), 0x3 (enable), 0x4 (disable)

Allocate HMA

AH = 0x1

Free HMA

AH = 0x2

Allocate extended memory

AH = 0x9

Free extended memory

AH = 0xA

Copy real/protected memory from/to real/protected memory

AH = 0xB

Lock/Unlock protected mode memory

AH = 0xC (Lock), 0xD (Unlock)

HIMEM.SYS moves memory in order to defragment it. Locking memory is useful when you will access the memory directly, within protected mode. Actually, because HIMEM puts the CPU in unreal mode, you can use the very same returned pointers directly.

VM86 Mode

Many of the existing applications were real-mode at the time protected mode was introduced. Even today, many (mostly games) are played under Windows. To force these applications (which think they own the machine) to cooperate, a special mode should be created.

The VM86 mode is a special flag to the EFlags register, allowing a normal 16-bit DOS memory map of 640KB which is forwarded via paging to the actual memory - this makes it possible to run multiple DOS applications at the same time without risking any chance for one application to overwrite another. EMM386.EXE puts the processor to that state. The OS performs a step-by-step watching to the process, making sure that the process won't execute something illegal. Normally also, you want to map all your other critical structures (GDT, IDT etc) above 1MB so they are not visible to any VM86 process.

To trigger VM86 mode, you can use PUSHFD and IRET:

mov ebp,esp
push dword  [ebp+4]
push dword  [ebp+8]
pushfd             
or dword [esp], (1 << 17)     ; set VM flags
push dword [ebp+12]        ; cs
push dword  [ebp+16]       ; eip
iret

Once the VM flag is set, you can load a normal "segment" to a segment register. Interrupt calls by DOS applications are caught by the OS and emulated through it - if possible. Also, some instructions are ignored, for example, if you do a CLI, the interrupts are not actually disabled. The OS sees that you prefer to not be interrupted and acts accordingly, but interrupts are still there.

All VM86 code executes in PL 3, the lowest privilege level. Ins/Outs to ports are also captured and emulated if possible. The interesting thing about VM86 is that there are two interrupt tables, one for the real and one for the protected mode. But only protected mode interrupts are executed.

VM86 was removed from 64-bit mode, so a 64-bit OS cannot execute 16-bit DOS code anymore. In order to execute such code, you need an emulator such as DosBox.

Many applications were also written to take advantage of the expanded memory, but the modern standard was the protected mode. EMM386 puts the CPU in VM86 mode and maps via paging memory over 1MB to real mode segments (over 0xA0000), so an application that would like to use expanded memory can use it via EMM386.EXE, which provides an LIM EMS int 0x67 interface. In addition, EMM386 allowed "devicehigh" and "loadhigh" commands in CONFIG.SYS, allowing applications to get loaded to these high segments if possible.

Physical Address Extensions (PAE)

PAE is the ability of x86 to use 36 address bits instead of 32. This increases the available memory from 4GB to 64GB. The 32-bit applications still see only a 4GB address space, but the OS can map (via paging) memory from the high area to the lower 4GB address space. This extension was added to x86 to cope with the (nowadays not enough) limit of 4GB, before 64-bit CPUs came to the foreground.

Enabling PAE (CR4 bit 5) means that now you have 3 paging levels: In addition to Page Directory and the Page Table , you have now the PDTD, Page Directory Pointer Table, which has four 64-bit entries. Each of the PDTD entries points to a Page Directory of 4KB (like in normal paging). Each entry in the new Page Directory is now 64 bit long (so there are 512 entries). Each entry in the new Page Directory points to a Page Table of 4KB (like in normal paging), and each entry in the new Page Table is now 64-bit long, so there are 512 entries. Because that would allow only a quarter of the original mapping, that's why 4 directory/table entries are supported. The first entry maps the first 1GB, the 2nd the 2nd GB, the 3rd the 3rd GB and finally, the 4th entry maps the 4th GB.

But now the "S" bit in the PDT has a different meaning: If not set, it means that the page entry is 4KB but if set, it means that this entry does not point to a PT entry, but it describes itself a 2MB page. So you can have different levels of paging traversal depending on the S bit.

There is a new flag in the Page Directory entry as well, the NX bit (Bit 63) which, if set, prevents code execution in that page.

This system allows the OS to handle memory over 4GB, but since the address space is still 4GB, each process is still limited to 4GB. The memory can be up to 64GB but a process cannot see the entire memory.

Direct Memory Access drivers however have a problem, because they don't use paged memory. If working in 32 bits, the driver has to manage the paging tables itself in order to be able to manipulate memory over 4GB and this cound mean incompatibilites with the operating system, unless a safe DMA API was exposed to the driver. For this reason, PAE quickly faded out in favor of 64-bit operating systems, in which it still remains a required paging level.

DPMI

For DOS applications, unreal mode was not enough, eventually a fully 32-bit capability application had to be created. DPMI (Dos Protected Mode Interface) was a driver that provided a (relative complex) interface to applications wishing to run in 32 bit protected mode. DOS extenders, based on DPMI, like DOS4GW and DOS32A were created to support applications (mostly games) that wanted to run in 32 bit while still having access to DOS interrupts. DPMI catches the interrupt call, switches to real mode, executes the interrupt and goes back to protected mode. DPMI even allows multitasking and multiple "virtual" 32 bit machines.

DOS extenders use a "Linear Executable" (LE or LX format) which contains native 32-bit code. DOS32A can load and run such an executable. Here is a FASM example of creating a LE executable with DPMI.

Detect DPMI using interrupt 2F:

Interrupt 0x2F, AX = 0x1687

Example from DJCPP:

modesw	dd	0			; far pointer to DPMI host's
					    ; mode switch entry point
	mov	ax,1687h		; get address of DPMI host's
	<a href="http://www.delorie.com/djgpp/doc/dpmi/api/2f1687.html">int	2fh</a>		      	; mode switch entry point
	or	ax,ax			; exit if no DPMI host
	jnz	error
	mov	word ptr modesw,di	; save far pointer to host's
	mov	word ptr modesw+2,es	; mode switch entry point
	or	si,si			; check private data area size
	jz	@@1		     	; jump if no private data area

	mov	bx,si			; allocate DPMI private area
	mov	ah,48h			; allocate memory
	int	21h			    ;  transfer to DOS
	jc	error			; jump, allocation failed
	mov	es,ax			; let ES=segment of data area

@@1:	mov	ax,0			; bit 0=0 indicates 16-bit app
	call	modesw			; switch to protected mode
	jc	error			; jump if mode switch failed
					; else we're in prot. mode now

App terminates via 0x4C int 0x21 (as in real mode). The rest of DPMI functions are provided through int 0x31 and include:

Real mode interrupt capturing (as function 0x25 int 0x21)
Real mode exception trapping
Call DOS interrupts either directly, or through int 0x31 function 3
Real mode callbacks
Sharing memory between DPMI clients
Paging
Setting hardware breakpoints
TSR capabilities

Many good games like The Dig were running under DPMI.

Long Mode

Architecture

Whatever methods created to overcome the 4GB limit of the x86, they would eventually lead to full 64-bit processors. Having discussed all the protected mode complexities, we are lucky to observe that the x64 CPU architecture is way simpler. The x64 CPU has 3 operation modes:

Real mode
Protected mode (called legacy mode)
Long mode, containing two submodes:
- Compatibility mode, 32 bit. This allows an 64-bit OS to run 32-bit applications natively.
- 64-bit mode

To work in Long mode, the programmer must take into consideration the facts below:

Unlike Protected mode, which can run with or without paging, long mode runs only with PAE and paging and in flat mode. All the segments are flat, from 0 to 0xFFFFFFFFFFFFFFFF and all memory addressing is linear. DS, ES, SS are ignored. The "flat" mode is the only valid mode in long mode. No segmentation.
You can get into long mode directly from real mode, by enabling protected mode and long mode within one instruction (this can work because Control Registers are accessible from real mode).
Although in theory any 64-bit value could be used as an address, in practise we don't need yet 2^64 memory. Therefore, current implementations only implement 48-bit addressing, which enforces all pointers to have bits 47-63 either all 0 or all 1. This means that you have 2 ranges of valid "canonical" addresses, one from 0 to 0x00007FFF'FFFFFFFF and one from 0xFFFF8000'00000000 through 0xFFFFFFFF'FFFFFFFF, for a 256TB of total space. Most OSes reserve the upper area for the kernel, and the lower area for the user space.

To verify that long mode is supported, we must check extended CPUID features:

mov eax, 0x80000000 
cpuid
cmp eax, 0x80000001
jb .NoLongMode

Registers

When running in 64-bit mode, the following 64-bit extensions are available:

RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP, RIP
8 new 64-bit registers added: R8 to R15. Lower 32 bits in R8D - R15D format, Upper 8 bits in R8W - R14W format and lower 8 bits in R8B - R14B format.

These registers are only available in 64-bit mode. In all other modes, including compatibility mode, they are not available.

GDT/IDT

Bit 53 of the GDT, previously reserved, is now the "L", bit. When 1, the Sz bit must also be 0, and this indicates an 64-bit code (the combination L = 1 and Sz = 1 is reserved and will throw an exception if used). The limits are always 0 to 0xFFFFFFFFFFFFFFFF and the base is always 0.

If your GDT resides in the lower 4GB of memory, you need not change it after entering long mode. However, if you plan to call SGDT or LGDT while in long mode, you must now deal with the 10-byte GDTR, which holds two bytes for the length of the GDT and 8 bytes for the physical address of it.

Any selector you might load to access a 64-bit segment is ignored, and DS, ES, SS are not used at all. All the segments are flat, and everything is done via paging. However GS and FS can still be used as auxilliary registers and their values are still subject to verification from the GDT. In Windows, FS points to the Thread Information Block.

IDT is similar to the protected mode's, the difference being the fact that each entry is expanded to contain an 64-bit physical address to the interrupt:

struc IDT_STR 
{
 .ofs0_15 dw ofs0_15
 .sel dw sel
 .zero db zero
 .flags db flags            ; 0 P,1-2 DPL, 3-7 index to the GDT
 .ofs16_31 dw ofs16_31
 .ofs32_63 dd ofs32_63
 .zero dd 0
}

There is no LDT, VM86, DPMI, unreal mode or call gates in long mode. Missing VM86 is the reason that 64-bit OSes cannot run 16 bit software without an emulator.

Long Mode Paging

In long mode the paging system adds a new top level structure, the PML4T which has 512 64-bit long entries which point to one PDPT and now the PDPT has 512 entries as well (instead of 4 in the x86 mode). So now you can have 512 PDPTs which means that one PT entry manages 4KB, one PDT entry manages 2MB (4KB * 512 PT entries), one PDPT entry manages 1GB (2MB*512 PDT entries), and one PML4T entry manages 512 GB (1GB * 512 PDPT entries). Since there are 512 PML4T entries, a total of 256TB (512GB * 512 PML4T entries) can be addressed.

This is another reason not to use the entire 64-bit for addressing. Using the entire thing would force us to have 6 levels of paging, where now four are needed.

Each of the "S" bits in the PDPT/PDT can be 0 to indicate that there is a lower level structure below, or 1 to indicate that the traversal ends here. If the PDPT S flag is 1 and the CPU supports it, then the page size is 1GB.

There is an Intel draft about PML5, a new top level structure which would allow 5 levels of paging, when the CPUS will support 56 bits of addressing.

To verify that 1GB pages are supported, we try EDX bit 26:

mov eax,80000001h
cpuid
bt edx,26
jnc .no1gbpg

Entering Long Mode

; Disable paging, assuming that we are in a see-through.
mov eax, cr0 ; Read CR0.
and eax,7FFFFFFFh; Set PE=0
mov cr0, eax ; Write CR0.
mov eax, cr4
bts eax, 5
mov cr4, eax ; Set PAE
mov ecx, 0c0000080h ; EFER MSR number. 
rdmsr ; Read EFER.
bts eax, 8 ; Set LME=1.
wrmsr ; Write EFER.
; Enable Paging to activate Long Mode. Assuming that CR3
' is loaded with the physical address of the page table.
mov eax, cr0 ; Read CR0.
or eax,80000000h ; Set PE=1.
mov cr0, eax ; Write CR0.

Turn off paging, if enabled. To do that, you must ensure that you are running in a "see through" area.
Set PAE, by setting CR4's fifth bit.
Create the new page tables and load CR3 with them. Because CR3 is still 32-bits before entering Long mode, the page table must reside in the lower 4GB.
Enable Long mode (note, this does not enter Long mode, it just enables it).
Enable paging. Enabling paging activates and enters Long mode.

Because the rdmsr/wrmsr opcodes are also available in Real mode, you can activate Long mode from Real mode directly by setting both PE and PM bits of CR0 simultaneously.

Entering 64-bit

Now you are in compatibility mode. Enter 64-bit mode by jumping to a 64-bit code segment:

; also db 066h if entering from a 16-bit code segment
db 0eah
dd LinearAddressOfStart64

The initial 64-bit segment must reside in the lower 4GB because compatibility mode does not see 64-bit addresses. Note that you must use the linear address, because 64-bit segments always start from 0. Note also that if the current compatibility segment is 16-bit default, you have to use the 066h prefix.

The only thing you have to do in 64-bit mode is to reset the RSP:

mov rsp,STACK64
shl rsp,4
add rsp,stack64_end

SS, DS, ES, are not used in 64-bit mode. That is, if you want to access data in another segment, you cannot load DS with that segment's selector and access the data. You must specify the linear address of the data. Data and stack are always accessed with linear addresses. "Flat" mode is not only the default, it is the only one for 64-bit.

Once you are in 64-bit mode, the defaults for the opcodes (except from jmp/call) are still 32-bit. So a REX prefix is required (0x40 to 0x4F) to mark a 64-bit opcode. Your assembler handles that automatically if it supports a "code64" segment.

In addition, a 64-bit interrupt table must now be set with a new LIDT instruction, this time taking a 10-byte operator (2 for the length and 8 for the location).

Returning to Compatibility Mode

To exit 64-bit mode, it is first necessary to return to compatibility mode. Because 0eah is not a valid jump when in 64-bit mode, you have to use a RETF trick to get back to a compatibility mode segment.

push code32_idx    ; The selector of the compatibility code segment
xor rcx,rcx    

mov ecx,Back32    ; The address must be an 64-bit address,
                  ; so upper 32-bits of RCX are zero.
push rcx
retf

This gets you back to compatibility mode. 64-bit OSs keep jumping from 64-bit to compatibility mode in order to be able to run both 64-bit and 32-bit applications.

Exiting from Long Mode

You have to setup all the registers again with 32-bit selectors - back to segmentation. Also you must be in a see-through area because to exit long mode you must deactivate paging. Of course, you can switch immediately to real mode by resetting the PM bit as well.

; We are now in Compatibility mode again
mov ax,stack32_idx 
mov ss,ax 
mov esp,stack32_end 
mov ax,data32_idx 
mov ds,ax
mov es,ax
mov ax,data16_idx
mov gs,ax
mov fs,ax

; Disable Paging to get out of Long Mode
mov eax, cr0 ; Read CR0.
and eax,7fffffffh ; Set PE=0.
mov cr0, eax ; Write CR0.

; Deactivate Long Mode
mov ecx, 0c0000080h ; EFER MSR number. 
rdmsr ; Read EFER.
btc eax, 8 ; Set LME=0.
wrmsr ; Write EFER.

; Back to protected mode

Interrupt priorities

Driver developers in Windows will know the meaning of IRQL. An IRQL is a CPU feature to prioritize interrupts. x86 and x64 has the CLI instruction all right to disable interrupts entirely, but in a modern multithreading system something that can prioritize interrupts should exist. Windows driver functions KeRaiseIrlq and KeLowerIrlq modify the CR8 register, settting the CPU interrupt priority (0 - 15, where 0 is PASSIVE_LEVEL and 2 is DISPATCH_LEVEL). When an interrupt is pending, its priority is compared to CR8. If the vector is greater, it is serviced, otherwise it is held pending until CR8 is set to a lower value. CR8 starts with 0 on CPU reset.

As of Intel's Vol 3A, section 10.8.3, the interrupt priority is the higher 4:7 bits of the interrupt vector.

Multiple Cores

A single CPU can execute one instruction at a time. Multitasking in single processors is generally the fast switching (at the software level) between different registers/paging for each process running, and this is so fast that it appears that processes run simultaneously.

A multiple core CPU is similar to having many single CPUs that share the same memory. Everything else (Registers, modes, etc) are specific to each CPU. That means that if we have an 8 core processor, we have to execute the same procedure 8 times to put it e.g. in long mode. We can have one processor to real mode and another processor in protected mode, another processor in long mode etc.

In multiple core configurations we are concerned with three things:

How to discover multiple processors and their properties
How to communicate from one CPU to another
How to synchronize access to sensitive data

Discovery

The Advanced Programmable Interrupt Controller (APIC) is a set of tables, found in memory, that will provide us the information we need. First we discover the presence of APIC:

mov eax,1
cpuid
bt edx,9
jc ApicFound

Second, we search for the Advanced Configuration and Power Interface (ACPI) in memory. The ACPI is the first of the APIC tables, it resides somewhere in BIOS memory, between physical addresses 0xE0000 and 0xFFFFF and it has the following header:

struct RSDPDescriptor 
{
 char Signature[8];
 uint8_t Checksum;
 char OEMID[6];
 uint8_t Revision;
 uint32_t RsdtAddress;

; The following is present if ACPI 2.0
 uint32_t Length;
 uint64_t XsdtAddress;
 uint8_t ExtendedChecksum;
 uint8_t reserved[3];
}

The above RSDP Descriptor contains the signature value which, for the first ACPI table, is 0x2052545020445352. If this signature is not found in the memory, then we don't have ACPI and therefore, there are no multiple CPU cores.

Each descriptor also has a checksum, which is verified with the following algorithm:

IsChecksumValid:
    PUSH ECX
    PUSH EDI
    XOR EAX,EAX
    .St:
    ADD EAX,[FS:EDI]
    INC EDI
    DEC ECX
    JECXZ .End
    JMP .St
    .End:
    TEST EAX,0xFF
    JNZ .F
    MOV EAX,1
    .F:
    POP EDI
    POP ECX
    RETF

In case we succeed in finding an ACPI 2.0 table and its ExtendedChecksum is verified, then we must use the XsdtAddress (which always points to lower 4GB) to find the other tables. If it is an ACPI 1.0 then we use the RsdtAddress.

Having found the address, we use it to locate the first APIC table. The starting table contains pointers to all the other tables (32 or 64 bit if APIC 2.x+) after the header. This physical address is over the 1MB and hence it is only accessible from protected (or unreal) mode. There are many ACPI tables but we are only interested in a few of them.

All of them have the following header:

struct ACPISDTHeader 
  {
  char Signature[4];
  unsigned long Length;
  unsigned char Revision;
  unsigned char Checksum;
  char OEMID[6];
  char OEMTableID[8];
  unsigned long OEMRevision;
  unsigned long CreatorID;
  unsigned long CreatorRevision;
  };

The first table that we will find contains the pointers to all other APIC tables after this header. The Length member contains the length of the entire table, including the header.

To find how many processors we have, we find the "MADT" table, a table which has the signature "APIC" in its header. After the standard header, we have:

At offset 0x24, the Local APIC Address, which we will need later.
At offset 0x2C, the rest of the MADT table contains a sequence of variable length records which enumerate the interrupt devices. Each record begins with the 2 header bytes, 1 for the type and one for the length. If the type bype is 0, then the bytes following the length byte contain 6 bytes, describing a physical CPU. The first byte is the ACPI Processor ID and the second byte is the APIC ID of this processor.

Looping the above table will reveal us all the installed processors along with their ACPI and APIC IDs.

Initial Startup

A CPU can communicate with another CPU by issuing an "Interprocessor Interrupt" (IPI). To prepare the APIC to manage interrupts, we have to enable the "Spurious Interrupt Vector Register", indexed at 0xF0:

; Assuming FS is loaded with a linear 4GB segment unreal mode
MOV EDI,[LocalApic]
ADD EDI,0x0F0
MOV EDX,[FS:EDI]
OR EDX,0x1FF
MOV [FS:EDI],EDX

After that, we are ready to send IPIs. An IPI (Interprocessor Interrupt) is sent by using the Interrupt Command Register of the Local APIC. This consists of two 32-bit registers, one at offset 0x300 and one at offset 0x310 (All Local APIC registers are aligned to 16 bytes):

The register at 0x310 is what we write it first, and it contains the Local APIC of the processor we want to send the interrupt at the bits 24 - 27.
The register at 0x300 has the following structure:

struct R300
    {
    unsigned char VectorNumber; // Starting page for SIPI
    unsigned char DestinationMode:3; //  0 normal, 1 low, 2 SMI, 4 NMI, 5 Init, 6 SIPI 
    unsigned char DestinationModeType:1; // 0 for physical 1 for logical
    unsigned char DeliveryStatus:1; // 0 - message delivered
    unsigned char R1:1;
    unsigned char InitDeAssertClear:1; 
    unsigned char InitDeAssertSet:1;
    unsigned char R2:2;
    unsigned char DestinationType:2; // 0 normal, 1 send to me, 2 send to all, 3 send to all except me
    unsigned char R3:12;
    };

Writing to register 0x300 will actually send the IPI (that is why you must write to 0x310 first). Note that if DestinationType is not 0, the Destination target in the register 0x310 is ignored. Under Windows, IPIs are sent with an IRQL level 29 (x86) or 14 (x64).

As we know, the CPU starts in real mode from 0xFFFF:0xFFF0 position, but this is only true for the first cpu. All other CPUs stay "asleep" until woken up, in a special state called Wait-for-SIPI. The main CPU awakes other CPUs by sending a SIPI (Startup Inter-Processor Interrupt) which contains the startup address for that CPU. Later on, there are other Inter-processor Interrupts to communicate between the CPUs.

To awake the processor, we send two special IPIs. The first is the "Init" IPI, DestinationMode 5, which stores the starting address for the CPU. Remember that the CPU starts in real mode. Because the processor starts in real mode, we have to give it a real memory address, stored in VectorNumber. The second IPI is the SIPI, DestinationMode 6, which starts the CPU. The starting address must be 4096 aligned.

Later Communication

Apart from INIT and SIPI, which we saw above, the local APIC can be used to send a normal interrupt, i.e., merely executing INT XX in the context of the target CPU. We have to take into consideration the following:

If the CPU is in HLT state, the interrupt awakes it, and when the interrupt returns the CPU resumes with the instruction after the HLT opcode. If there is also a CLI, then we must send a NMI interrupt (A flag in the APIC Interrupt Register) to wake the CPU.
If the CPU is in HLT state and we send again an INIT and a SIPI, the CPU starts all over again from real mode.
The interrupt must exist in the target processor. For example, in protected mode, the interrupt must have been defined in IDT.
The Local APIC is common to all CPUS (memorywise), therefore, we must lock for write access (mutex) before we can issue the interrupt.
Because the registers cannot be passed from CPU to CPU, we have to write all the registers (that will be used for the interrupt, if any) in a separated memory area.
The interrupt might fail, so, you have to rely on some inter-cpu communication (via shared memory and mutexes) to verify the delivery.
Finally, the handler of the interrupt must tell its own Local APIC that there is an "End of Interrupt". It was similar to int 0x21's out 020h,al in the past. Now we write to the EOI register (LocalApic + 0xB0) the value 0 (End Of Interrupt).

Synchronization

Since the CPUS share the same memory, it is crucial to synchronize write and read accesses to critical parts of it. In Windows of course we have mutexes ready to be used, but here some extra work has to be done. We can create our own mutex variable as follows:

Initialization, put a byte to value 0xFF
Lock mutex, decrease its value
Unlock mutex, increase its value unless already 0xFF
Wait for a mutex, but not lock it: A simple loop.

; assuming edi has the address
.Loop1:        
CMP byte [edi],0xff
JZ .OutLoop1
pause 
JMP .Loop1
.OutLoop1:

Note the pause opcode (equal to rep nop). This is a hint to the cpu that we are inside a spin loop, which greatly enhances performance because code prefetching is avoided.

Our problem is to wait for a mutex, then grab it when it is free (similar to WaitForSingleObject()). This code is not going to work:

.Loop1: 
CMP byte [edi],0xff 
JZ .OutLoop1 
pause 
JMP .Loop1 
.OutLoop1:
.MutexIsFree:
DEC [edi]

The reason is that, between the JZ command (which has verified that the mutex is free) and before the DEC [edi] is executed, another CPU might grab the mutex (race condition).

Fortunately for us, the CPU provides a LOCK CMPXCHG opcode which atomically grabs the lock for us:

.Loop1:        
CMP byte [edi],0xff
JZ .OutLoop1
pause 
JMP .Loop1
.OutLoop1:
; Lock is free, can we grab it?
mov bl,0xfe
MOV AL,0xFF
LOCK CMPXCHG [EDI],bl
JNZ .Loop1 ; Write failed, someone got us
.OutLoop2: ; Lock Acquired

We use the CMPXCHG instruction which, along with the LOCK prefix, atomically tests [edi] if it is still 0xFF (the value in AL), and if yes, then it writes BL to it and sets the ZF. If another CPU has done the same meanwhile, the ZF is cleared and BL is not moved to the [edi].

Virtualization

Virtualization, techically, is a "system" inside the system. Its a clone of the processor running inside the same processor. It is not very much complex to setup and it greatly enhances computing since you are able to run another OS inside an existing OS.

Each CPU (called Host) can run one Virtual Machine (called guest) at a time. You can configure multiple guests per CPU and pause/resume each guest, much like multitasking. If you have 8 CPU cores of course, you can have 8 guests running simultaneously.

The lifecycle of VM operations is as follows:

Test if the CPU supports virtualization:

mov eax,1
cpuid
bt ecx,5
jc VMX_Supported
jmp VMX_NotSupported

Check CPU-specific revision from the IA32_VMX_BASIC register:
```
mov ecx, 0480h
rdmsr
```
This 64-bit register contains important information for our project:
- Bits 0 - 31: 32-bit VMX Revision Number
- Bits 32 - 44: Number of bytes (up to 4096) which we will need to allocate later.
Enable VMX operations
```
mov rax,cr4
bts rax,13
mov cr4,rax
```
Configure a VMXON structure. This is a 4096-aligned CPU-specific array and its size must be the number we got from the IA32_VMX_BASIC register. A VMXON structure contains:
- 4 bytes which hold the revision number
- 4 bytes that are used for VMX Abort data (we will check that later),
Execute the VMXON command
For each guest, configure a VMCS. A VMCS is a 4096-aligned CPU-specific array which we need to allocate for each guest, and its size must be the number we got from the IA32_VMX_BASIC register. To load a VMCS for configuration we use the VMPTRLD opcode. To read or write into the VMCS we use the VMREAD, VMWRITE and VMCLEAR. A VMCS contains:
- 4 bytes that are used for VMX Abort data (we will check that later),
- The rest of the bytes are used by VMCS groups (we will check that later).
Configure the memory available to the guests.
Launch a guest with VMLAUNCH.
Guest returns (exits) to the host on specific conditions.
Host uses VMPAUSE, VMRESUME to pause or resume its guests.
When the guest terminates, host uses VMXOFF to turn off VMX operations.

VMCS Groups

The rest of the VMCS (that is, after the first 8 bytes (revision + VMX Abort) is divided into 6 subgroups:

Guest State
Host State
Non root controls
VMExit controls
VMEntry controls
VMExit information

Each of the above fields contains important information. We will look at them one by one. To mark a VMCS for further reading/writing with VMREAD or VMWRITE, you would first initialize its first 4 bytes to the revision (as with the VMXON structure above), and then execute a VMPTRLD with its address.

Appendix H of the 3B Intel Manual has a list of all indices. For example, the index of the RIP of the guest is 0x681e. To write the value 0 to that field, we would use:

mov rax,0681eh
mov rbx,0
vmwrite rax,rbx

Not all features are always present in all processors. We must check the VMX MSRs for available features before testing them. Intel's 3B Appendix G contains all these MSRs. To load a MSR, you put its number to RCX and execute the rdmsr opcode. The result is in RAX.

IA32_VMX_BASIC (0x480): Basic VMX information including revision, VMCS size, memory types and others.
IA32_VMX_PINBASED_CTLS (0x481): Allowed settings for pin-based VM execution controls.
IA32_VMX_PROCBASED_CTLS (0x482): Allowed settings for processor based VM execution controls.
IA32_VMX_PROCBASED_CTLS2 (0x48B): Allowed settings for secondary processor based VM execution controls.
IA32_VMX_EXIT_CTLS (0x483): Allowed settings for VM Exit controls.
IA32_VMX_ENTRY_CTLS (0x484): Allowed settings for VM Entry controls.
IA32_VMX_MISC MSR (0x485): Allowed settings for miscellaneous data, such as RDTSC options, unrestricted guest availability, activity state and others.
IA32_VMX_CR0_FIXED0 (0x486) and IA32_VMX_CR0_FIXED1 (0x487): Indicate the bits that are allowed to be 0 or to 1 in CR0 in the VMX operation.
IA32_VMX_CR4_FIXED0 (0x488) and IA32_VMX_CR4_FIXED1 (0x489): Same for CR4.
IA32_VMX_VMCS_ENUM (0x48A): enumerator helper for VMCS.
IA32_VMX_EPT_VPID_CAP (0x48C): provides information for capabilities regarding VPIDs and EPT.

The Host State

This contains the following information (In parentheses, the bit number):

CR0,CR3,CR4,RSP,RIP (64 each)
CS,SS,DS,ES,FS,GS,TR selectors (16 each)
FS,GS,TR,GDTR,IDTR base addresses (64 each)
IA32_SYSENTER_CS (32)
IA32_SYSENTER_ESP (64)
IA32_SYSENTER_EIP (64)
*IA32_PERF_GLOBAL_CTRL (64)
*IA32_PAT (64)
*IA32_EFER (64)

The host state tells the CPU how to return to the host after the guest exits. After executing a successfull VMLAUNCH or VMRESUME command (if this command fails, execution resumes after it), then the host is paused until the guest exits. When the guest exits, the host is reloaded with values from this VMCS group.

The Guest State

This contains the following information (In parentheses, the bit number):

CR0,CR3,CR4,DR7,RSP,RIP,RFLAGS, (64 each)
For each of CS,SS,DS,ES,FS,GS,LDTR,TR:
- Selector (16)
- Base address (64)
- Segment limits (32)
- Access rights (32)
For GDTR and IDTR:
- Base address (64)
- Limit (32)
IA32_DEBUGCRTL (64)
IA32_SYSENTER_CS (32)
IA32_SYSENTER_ESP (64)
IA32_SYSENTER_EIP (64)
IA_PERF_GLOBAL_CTRL (64)
IA32_PAT (64)
IA32_EFER (64)
SMBASE (32)
Activity State (32) - 0 Active , 1 Inactive (HLT executed) , 2 Triple fault occured , 3 waiting for startup IPI (SIPI).
Interruptibility state (32) - a state that defines some features that should be blocked in the VM.
Pending debug exceptions (64) - to facilitate hardware breakpoings with DR7.
VMCS Link pointer (64) - reserved, set to 0xFFFFFFFFFFFFFFFF.
VMX Preemption timer value (32)
Page Directory pointer table entries (4x64), pointers to pages.

This group defines how the guest will start. The guest can be started in two modes:

Paged 32 bit protected mode.
Real mode (unrestricted guest), if the CPU supports it.

Starting a guest in paged protected mode does not allow later the guest to turn into long mode and does not allow modifications of GDT. If a guest expects a real mode start but unrestricted guest is not available, then you can start in VM86 mode.

In unrestricted guest, the guest starts in real mode and can modify any register allowed by the VMCS control fields. Note that you still load protected mode style segments for CS and the real mode starts with a protected mode selector, but you can immediately load a new real mode segment with a JMP.

The Execution Control Fields

These fields configure what is allowed to be executed in the guest and what is not. Everything not allowed causes a VMEXIT. The sections are:

Pin-Based (32b) : Interrupts
Processor-Based (2x32b)
- Primary: Single Step, TSC HLT INVLPG MWAIT CR3 CR8 DR0 I/O Bitmaps
- Secondary: EPT, Descriptor Table Change, Unrestricted Guest and others
Exception bitmap (32b): One bit for each exception. If bit is 1, the exception causes a VMExit.
I/O bitmap addresses (2x64b): Controls when IN/OUT cause VMExit.
Time Stamp Counter offset
CR0/CR4 guest/host masks
CR3 Targets
APIC Access
MSR Bitmaps

For example, you can configure it so an exception would make it to the host, instead of being caught in the guest. Similarily you might not allow GDT changes, Control Register changes etc.

Exit Control Fields

These fields tell the CPU what to load and what to discard in case of a VMExit:

VMExit Controls (32b)
VMExit Controls for MSRs

Exit Control Fields

These fields tell the CPU what to inject to the guest in case of an exit:

VMEntry Controls (32b)
VMEntry Controls for MSRs
VMEntry Controls for event injection

Exit Information Field (Read only)

Basic information
- Exit Reason (32)
- Exit Qualification (64)
- Guest Linear Address (64)
- Guest Physical Address (64)
Vectored exit information
Event delivery exits
Intstruction execution exits
Error field

EPT

An EPT is a mechanism that translates host physical address to guest physical addresses. It is exactly the same as the long mode paging mechanism.

If you start the guest in Paged Protected Mode, then EPT is not required. Using Unrestricted Guest requires us to use EPT. You can check the 0x48B (IA32_VMX_PROCBASED_CTLS2) MSR bit 7 to see if Unrestricted Guest is supported.

Manual Exits

A guest that knows that is a guest might want to deliberately exchange information with its host. For this reason, the instruction VMCALL is provided to manually trigger an exit.

DMMI

DPMI works, but a long mode driver is also needed. Therefore I have decided to create a TSR service, included in the github project. I've called it DOS Multicore Mode Interface. It is a driver which helps you develop 32 and 64 bit applications for DOS, using int 0xF0. This interrupt is accessible from both real, protected and long mode. Put the function number to AH.

To check for existence, check the vector for INT 0xF0. It should not be pointing to 0 or to an IRET, ES:BX+2 should point to a dword 'dmmi'.

Int 0xF0 provides the following functions to all modes (real, protected, long)

AH = 0, verify existence. Return values, AX = 0xFACE if the driver exists, DL = total CPUs, DH = virtualization support (0 none, 1 PM only, 2 Unrestricted guest). This function is accessible from real, protected and long mode.
AH = 1, begin thread. BL is the CPU index (1 to max-1). The function creates a thread, depending on AL:
- 0, begin (un)real mode thread. ES:DX = new thread seg:ofs. The thread is run with FS capable of unreal mode addressing, must use RETF to return.
- 1, begin 32 bit protected mode thread. EDX is the linear address of the thread. The thread must return with RETF.
- 2, begin 64 bit long mode thread. EDX holds the linear address of the code to start in 64-bit long mode. The thread must terminate with RET.
- 3, begin virtualized thread. BH contains the virtualization mode (1 for unrestricted guest real mode thread, and 2 for protected mode), and EDX the virtualized linear stack (or in seg:ofs format if unrestricted guest). The thread must return with RETF or VMCALL.
AH = 5, mutex functions. This function is accessible from all modes.
- AL = 0 => initialize mutex to ES:DI (real) , EDI linear (protected), RDI linear (long).
- AL = 1 => Lock mutex
- AL = 2 => Unlock mutex
- AL = 3 => Wait for mutex

AH = 4, execute real mode interrupt. This function is accessible from all modes. AL is the interrupt number, BP holds the AX value and BX,CX,DX,SI,DI are passed to the interrupt. DS and ES are loaded from the high 16 bits of ESI and EDI.
AH = 9, switch to mode. AL = 0 -> Unreal mode, returns immediately (also available from protected and long mode int 0xF0). AL = 1 -> Protected mode, ECX = linear address to start. AL = 2 -> Long Mode, ECX = linear address to start.

Now, if you have more than one CPU, your DOS applications/games can now directly access all 2^64 of memory and all your CPUs, while still being able to call DOS directly. In order to avoid calling int 0xF0 directly from assembly and to make the driver compatible with higher level languages, an INT 0x21 redirection handler is installed. If you call INT 0x21 from the main thread, INT 0x21 is executed directly. If you call INT 0x21 from protected or long mode thread, then INT 0xF0 function AX = 0x0421 is executed automatically.

Virtualization Debugger

Debugging protected or long mode under DOS is next to impossible. I am now trying to create a simple DEBUG enhancement, called VDEBUG, which should be able to debug any DOS app in virtualization.

This app should perform the following:

Load the debugee (int 0x21, function 0x4B01)
Enter long mode (int 0xf0, function 0x0902)
Prepare virtualization structures (int 0xf0, function 0x0801)
Launch an unrestricted guest VM
In the VM, set the trap flag so each opcode causes a VMEXIT.
Jump to the entry point of the debugee
When target process calls int 0x21 function 0x4C to terminate, control returns to the command next to the int 0x21 function 0x4B01 call. Check there if under virtual machine. If so, do VMCALL to exit.
Go back to real mode and exit.

At the moment, the implemented functions are:

r - (registers) - shows Control, General, Segment regs, Dissassembly and bytes using UDIS86
g - (go) - runs program
t - (trace) - traces commands
h - (help) - shows help
q - (quit) - quits

Compile with VDEBUG=1 in config.asm to enable VDebug.

Multicore Debugger

Debugging protected or long mode under DOS is next to impossible (again). I am now trying to create a simple DEBUG enhancement, called MDEBUG, which should be able to debug any DOS app from another CPU core.

This app should perform the following:

Jump to another core
Load the debugee (int 0x21, function 0x4B01)
Set the trap flag
On exception, HLT the first processor then go to the MDEBUG processor
On resume, send resume IPI to the first processor

This project is not yet created, but I hope that it will be here soon!

Switcher

True DOS multitasking with this DMMI client. This app should perform the following:

Prompt for core, executable and parameters.
Run the executable in virtualization mode within the specific processor.
On some key combination (for example Ctrl+Alt+Ins), VMCALL and pause the VM
Switch between applications on demand.

Soon to be created!

The project

The full github project includes many functions discussed in this article. It's arranged with 4 filters: 16 bit code, 32 bit code, data, DMMI client and configuration files.

The fact that you made it to this end means that you are truly decisive. Have fun and good luck!

References

http://www.fysnet.net/emsinfo.htm, EMS info
http://www.ctyme.com/rbrown.htm, Ralf Brown Interrupt List
http://bochs.sourceforge.net, Bochs
https://github.com/Himmele/My-Blog-Repository/blob/master/Operating%20Systems/Build%20Your%20Own%20OS/Protected%20Mode%20Tutorial.txt, Till Gerken PM Tutorial
https://wiki.osdev.org/Context_Switching, Task Switching
http://www.sudleyplace.com/dpmione/dpmispec1.0.pdf, DPMI specification
http://www.delorie.com/djgpp/doc/dpmi/, DJCPP DPMI examples
http://www.sudleyplace.com/swat/, 386SWAP protected mode debugger
http://dos32a.narechk.net/index_en.html, DOS32A DPMI extender
http://www.dumais.io/index.php?article=ac3267239dd3e34c061de6413203fb98, VMX Examples and Diagrams