In PyTorch, I use `torch.cuda.memory.CUDAPluggableAllocator` and `cudaMallocManaged` methods to allocate (GPU memory + DRAM + swap memory).
When I do this, my computer becomes very slow with extremely high `iowait` values and will quickly hang up after using up all the GPU memory and DRAM, but the `CPU usage` values appear to be normal.
I am using *Ubuntu Server 22.04* *Anaconda3* and *Docker*. Since Linux automatically adjusts and uses swap memory once all the DRAM is used up. I want to train an AI model, predict with it, and store it without any computer lagging while running the computer with the required storage space exceeding the (GPU memory + DRAM + swap memory).
##################################################
Reference:
- Using custom memory allocators for CUDA `[https://pytorch.org/docs/stable/notes/cuda.html](https://stackoverflow.com)`
- Introduction swap memory `[https://blogs.oracle.com/linux/post/understanding-linux-kernel-memory-statistics](https://stackoverflow.com)`
What I have tried:
What did you try and what were you expecting?
Describe what you tried, what you expected to happen, and what actually resulted. Minimum 20 characters.
My goal is to train an AI model, predict with it, and store it without any computer lagging while running the computer with the required storage space exceeding the (GPU memory + DRAM + swap memory).
In order to achieve my goal, I have tried the following three methods or a combination of them:
- Force a program to use swap memory directly before DRAM runs out.
- Use PyTorch's built-in functions to accomplish my objectives.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.
I have tried `cgroup v2`, Docker (including Nvidia Docker runtime), Linux preloading `vm.swappiness` functions, PyTorch `fbgemm` UVM tensor, and `torch.cuda.memory.CUDAPluggableAllocator` but could not achieve my goal.
---
The following command line is expected to implement the following 2 methods:
- Force a program to use swap memory directly before DRAM runs out.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.
`cgroup v2` is used to limit DRAM use.
The command line trying to achieve my goal is:
```
echo 42949672960 > /path/to/the/location/memory.high
```
---
The following segment of a command line is expected to implement the following 2 methods:
- Force a program to use swap memory directly before DRAM runs out.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.
Docker is used to limit DRAM or Swap Devices use too.
The segment of this command line is:
```
docker run ... \
--memory=10g \
--memory-swap=3789g \
...
```
and [EDITTED ON 2024-05-25]
```
docker run ... \
--device-write-bps=/path/to/device:1500mb \
--device-read-iops=/path/to/device:1500gb \
...
```
---
The following source code is expected to implement the following 2 methods:
- Use PyTorch's built-in functions to accomplish my objectives.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.
`torch.cuda.memory.CUDAPluggableAllocator` method is called from alloc.so which is compiled from the following alloc.cc source code:
```
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda-11.8/include -shared -fPIC
#include <sys types.h="">
#include <cuda_runtime_api.h>
#include <iostream>
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream)
{
void *ptr;
cudaMallocManaged(&ptr, size);
return ptr;
}
void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream)
{
cudaFree(ptr);
}
}
```
and
```
#include <sys types.h="">
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda-11.8/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
void *ptr;
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc(&ptr,size,cudaHostAllocMapped);
return ptr;
}
void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
cudaFreeHost(ptr);
}
}
```
---
I also tried `cpulimit`, `prlimit` and `nice` but it still doesn't work. [EDITTED ON 2024-05-26]
`cpulimit` command line
```
cpulimit --pid $(processs_pid) --limit=15 --lazy --background
```
It is limited to <100, but the process is lagging and auto-killed.
`nice` command line
```
nice -n 19 python /path/to/file.py
```
This command does not solve the lagging problem.
And `prlimit`
The command line
```
prlimit -m=42949672960 python3 /path/to/file.py
```
Process limit status
```
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set 42949672960 42949672960 bytes
Max processes 256496 256496 processes
Max open files 1024 1048576 files
Max locked memory 8419708928 8419708928 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 256496 256496 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
```
This command does not solve the lagging problem too.
---
I have tried to use PyTorch `fbgemm` library, but the result is similar to employing `torch.cuda.memory.CUDAPluggableAllocator`.
Would there be any other possible methods to achieve my goal?
##################################################
Complement 1:
I have already used M.2. SSD nvme PCIe 3.0 x 2 to swap memory.
As I know, A PCIe 3.0 has a maximum bandwidth of around 3.x Gbps, I have 2 M.2. SSD.
##################################################
Reference:
- swappiness `[https://phoenixnap.com/kb/swappiness](https://stackoverflow.com)`
- Use RAM after GPU memory is not enough [`https://stackoverflow.com/questions/27035851/use-ram-after-gpu-memory-is-not-enough`](https://stackoverflow.com)
- What is the maximum read and write speed for pcie 3.0 x4 m.2 slots? [`https://pcpartpicker.com/forums/topic/391989-what-is-the-maximum-read-and-write-speed-for-pcie-30-x4-m2-slots?__cf_chl_tk=ytRz0fL2zwxqVfoQBY0G6gwWnFHiwRBSpcV9dbbeSEU-1716647279-0.0.1.1-1791`](https://stackoverflow.com)