Get Ready
Anyone that has already read my "infamous trilogy":
would want to combine all the stuff in one nice application. Here is such a combination, along with some new tips/techniques not discussed in the previous articles. It is implemented as a TSR which other apps can call for true multithreading in real, protected or long mode in raw DOS.
Using this code, you can create a DOS app that can:
- Use all your CPUs together
- Lock/Unlock mutexes
- Start threads in real, protected, long and virtualized mode
You need flat assembler, and a freedos installation in some virtualization environment that can have multiple cores. VMWare works until virtualization. DOSBox doesn't because it doesn't expose an ACPI. Bochs will work in the special SMP edition for real, protected and long mode with virtualization. VirtualBox support is not yet completed. My github project includes all these setups for your convenience.
Background
- 1024 assembly books
- 4.023 x 10^23 C++ lines written
- 1 << 62 free space in your mind. The upper bits are reserved for the kernel.
- Lots of patience and humor :)
Locking the Mutex
Yes in Win32, you have the nice Mutex functions. But what about in raw DOS?
First, a word about spin loops. When a Win32 thread calls WaitForSingleObject
, the kernel checks if the object is signaled and, if not, it does not schedule the thread for resuming. If there is no thread to be scheduled, the kernel halts the CPU code with the HLT
instruction, until later. In our little program, we own the system, there is no scheduler. So the code will simply spin loop until the mutex is available.
Therefore, one would expect code like this:
.Loop1:
CMP [shared_var],0xFF
JZ .MutexIsFree
JMP .Loop1
.MutexIsFree:
MOV [shared_var],BL
Not so. The problem is that, when the mutex is released, another CPU might lock the variable before this code. That is, something might be executed after the JZ command but before the MOV command.
Therefore, we have to use some atomic operation to achieve the lock:
CMP [shared_val],BL
JZ .OutLoop2
.Loop1:
CMP [shared_val],0xFF
JZ .OutLoop1
pause
JMP .Loop1
.OutLoop1:
MOV AL,0xFF
LOCK CMPXCHG [shared_val],BL
JNZ .Loop1
.OutLoop2:
The magic here is simple. We use the CMPXCHG
instruction which, along with the LOCK
prefix, atomically tests the shared val
if it is still 0xFF
(the value in AL
), and if yes, then it writes BL
to it and sets the ZF
. If another CPU has grabbed the mutex, the ZF
is cleared and BL
is not moved to the shared_var
. Most convenient.
The another interesting thing is the pause
opcode, a hint to the CPU that we are inside a spin loop. This greatly improves performance since the CPU knows we are in a spin loop and therefore, it will not prefetch code.
Waking the CPUs
As we saw in the trilogy, we send the INIT
and the SIPI
. The CPU must start in a 4096-aligned address, so I've filled an array with NOPs and adjust the startup address accordingly. The CPU starts in real mode.
Therefore, a "SipiStart
" routine would be like that:
SipiStart:
db 4096 dup (144)
CLI
mov di,DATA16
mov ds,di
lidt fword [ds:RealIDT]
STI
call FAR CODE16:EnterUnreal
MOV EDI,[DS:LocalApic]
ADD EDI,0x0F0
MOV EDX,[FS:EDI]
OR EDX,0x1FF
MOV [FS:EDI],EDX
mov di,StartSipiAddrOfs
jmp far [ds:di]
Anyway, to access the APIC, I have to enter unreal mode, so I call EnterUnreal
. Note the FAR call; The segment value in which EnterUnreal
begins is not the same with the CS which is loaded during the SIPI. A newly awoken CPU must also enable spurious vector and software APIC, as we have seen earlier. Finally, the code jumps far to the 'startup
' address for the CPU, depending on the CPU index.
Interprocessor Interrupts
The APIC provides us a way to send a message to another CPU. Apart from INIT
and SIPI
, which we saw earlier, the local APIC can be used to send a 'normal
' interrupt, i.e., merely executing INT XX
in the context of the target CPU. We have to take into consideration the following:
- If the CPU is in
HLT
state, the interrupt awakes it, and when the interrupt returns the CPU resumes with the instruction after the HLT
opcode. If there is also a CLI
, then we must send a NMI interrupt (A flag in the APIC Interrupt Register) to wake the CPU. - If the CPU is in
HLT
state and we send again an INIT
and a SIPI
, the CPU starts all over again from real mode. - The interrupt must exist in the target processor. For example, in protected mode, the interrupt must have been defined in
IDT
. - The Local APIC is common to all CPUS (memorywise), therefore, we must lock for write access (mutex) before we can issue the interrupt.
- Because the registers cannot be passed from CPU to CPU, we have to write all the registers (that will be used for the interrupt, if any) in a separated memory area.
- The interrupt might fail. I don't know why, but that's what they say. So, you have to rely on some inter-cpu communication (via shared memory and mutexes) to verify the delivery. I'm doing that in my code with a simple flag.
- Finally, the handler of the interrupt must tell its own Local APIC that there is an "End of Interrupt". Remember out 020h,al in the past? Now we write to the EOI register (
LocalApic + 0xB0
) the value 0
.
CPU Real Mode
If CPU will be running in real mode, you may want to call DOS. It will work, provided that no other CPU calls DOS at the same time, which of course cannot be assumed in our simple app. Therefore, you have to use int 0xF0
function 5
to manage mutexes. The thread starts automatically in unreal mode and with stack and FS stored. The thread terminates with retf
. If you call DOS through interrupt 0xF0 function 4, then locking is automatically provided.
This is the code in dmmic.asm real mode thread:
rt1:
sti
push cs
pop ds
mov dx,m1
mov ax,0x0900
int 0x21
push cs
pop es
mov di,mut1
mov ax,0x0503
int 0xF0
retf
CPU Protected Mode
This thread runs in 32-bit full 4GB protected mode. GS is pointing to base-0 32-bit data. It uses int 0xF0
to call DOS, then exits:
SEGMENT T32 USE32
rt2:
mov ax,0
int 0xF0
mov ax,0x0421
mov bp,0x0900
xor esi,esi
mov si,MAIN16
shl esi,16
mov dx,m2
int 0xF0
mov ax,0x0503
linear edi,mut1,MAIN16
int 0xF0
retf
CPU Long Mode
As I had said in the trilogy, long mode can be entered directly from real mode, because the instructions RDMSR
and WRMSR
are available. This is also implemented in two pieces. One to prepare the long mode by:
- Loading the GDT.
- Preparing a see-through page table for the first 1GB and ,apping the Local APIC to a fixed position (1GB - 2MB) memory area, because the Local APIC is usually located at
0xFEE00000
, which means it won't be visible in our 1GB see through, OR, preparing a 4GB page table with 1GB pages, if your system supports 1GB pages. Most do. - Enabling
PAE
, PSE
, and long mode.
And one to enter long mode by enabling paging, enabling interrupts with int 0xf0
accessible, then jumping to the code. Remember long mode is flat 64 bit and CS
,DS
,ES
,SS
have no meaning. Or so they say, I still had to set the SS
to page64_idx
in Bochs. Perhaps a Bochs bug?
SEGMENT T64 USE64
rt3:
nop
nop
nop
nop
nop
mov ax,0
int 0xF0
mov rax,0x0421
mov rbp,0x0900
xor rsi,rsi
mov si,MAIN16
shl rsi,16
mov rdx,m2
mov ax,0x0503
linear rdi,mut1,MAIN16
int 0xF0
ret
CPU Virtualized Protected Mode
This thread runs in 32-bit full 4GB virtualized protected mode. It can still call DOS. This mode is very useful since, whatever your thread might do, it can never crash the entire PC, only exit with a VMEXIT procedure.
v1:
mov ax,0
int 0xF0
mov ax,0x0421
mov bp,0x0900
xor esi,esi
mov si,MAIN16
shl esi,16
mov dx,m2
int 0xF0
mov ax,0x0503
linear edi,mut1,MAIN16
int 0xF0
retf
The DMMI
I've called it DOS Multicore Mode Interface. It is a driver which helps you develop 32 and 64 bit applications for DOS, using int 0xF0
. This interrupt is accessible from both real, protected and long mode. Put the function number to AH
.
To check for existence, check the vector for INT 0xF0
. It should not be pointing to 0
or to an IRET,
ES:BX+2 should point to a dword 'dmmi'.
Int 0xF0
provides the following functions to all modes (real
, protected
, long
)
AH = 0
, verify existence. Return values, AX = 0xFACE
if the driver exists, DL
= total CPUs. This function is accessible from real
, protected
and long
mode. AH = 1
, begin thread. BL is the CPU index (1
to max-1
). The function creates a thread, depending on AL
:
0
, begin (un)real mode thread. ES:DX
= new thread seg:ofs. The thread is run with FS capable of unreal mode addressing, must use RETF
to return. 1
, begin 32 bit protected mode thread. EDX
is the linear address of the thread. The thread must return with RETF
. 2
, begin 64 bit long mode thread. EDX
holds the linear address of the code to start in 64-bit long mode. The thread must terminate with RET
. 3
, begin virtualized thread. BH contains the virtualization mode (currently only mode 2 = protected mode virtualization is supported), and EDX the virtualized linear stack. The thread must return with RETF
or VMCALL
.
AH = 5
, mutex functions.
AL = 0
=> initialize mutex to ES:DI
(real) , EDI
linear (protected), RDI
linear (long). AL = 1
=> Lock mutex AL = 2
=> Unlock mutex AL = 3
=> Wait for mutex
AH = 4
, execute real mode interrupt. AL
is the interrupt number, BP
holds the AX
value and BX,CX,DX,SI,DI
are passed to the interrupt. DS
and ES
are loaded from the high 16 bits of ESI
and EDI
.
Now, if you have more than one CPU, your DOS game can now directly access all 2^64 of memory and all your CPUs, while still being able to call DOS directly. Isn't that fun?
INT 0x21 Redirection
In order to avoid calling int 0xF0
directly from assembly and to make the driver compatible with higher level languages, an INT 0x21
redirection handler is installed. If you call INT 0x21
from the main thread, INT 0x21
is executed directly. If you call INT 0x21
from protected
or long
mode thread, then INT 0xF0
function AX = 0x0421
is executed automatically.
So with a bit of luck, you can use your favorite stdio
functions from a C function in another thread directly!
The Code
Once you run entry.exe with /r, the library installs as a TSR
and int 0xf0
is available. DMMIC.asm shows example calls.
ToDo
- Add more virtualization modes
History
- 08-1-2018: Added virtualization capabilities
- 07-1-2018: Fixed Long mode int 0xF0 call
- 06-1-2018: Updated DMMI to my new github project
- 22-5-2015: Thanks to Brendan for the synchronization tip
- 18-5-2015: Fixed multiple call bug with End of Interrupt write
- 17-5-2015: First release
I'm working in C++, PHP , Java, Windows, iOS, Android and Web (HTML/Javascript/CSS).
I 've a PhD in Digital Signal Processing and Artificial Intelligence and I specialize in Pro Audio and AI applications.
My home page: https://www.turbo-play.com