|
Following KarsonV's suggestion, I removed the container completely and rebuilt from compose. Also removed all the old config files and any remaining module files. Same error.
|
|
|
|
|
Here's my compose snippet:
codeproject.ai:
container_name: codeproject.ai
image: codeproject/ai-server
privileged: true
devices:
- /dev/apex_0:/dev/apex_0
- /dev/apex_1:/dev/apex_1
ports:
- 32168:32168
volumes:
- /mnt/docker/codeproject.ai:/etc/codeproject/ai
- /mnt/docker/codeproject.ai/opt/codeproject/ai:/app/modules
restart: always
Steps to troubleshoot:
docker stop codeproject.ai
docker rm codeproject.ai
sudo rm /mnt/docker/codeproject.ai/*
sudo apt update
sudo apt upgrade
sudo reboot
docker compose up -d codeproject.ai
did not resolve
docker stop codeproject.ai
docker rm codeproject.ai
sudo rm -rf /mnt/docker/codeproject.ai/*
Change
container_name: codeproject.ai to
container_name: codeproject.ai:2.6.2
docker compose up -d codeproject.ai
Resolved. Reverting back to 2.6.2 resolved the issue for me. Happy to perform some further troubleshooting if @chris-maunder wants to investigate.
|
|
|
|
|
Q1. Can this be run on Windows 7 SP1? If unable that's too bad. (many others like "LocalAI" never answer this question)
Q2. What is the minimum reqirement to use this for text generation/chat? How much memory do I need and/or GPU?
Q3. Can I use this without internet (offline); does this really work without it? I'm planning to download the installer and install offline.
Q4. Does your AI can process non-English language? Can I teach it based on my text data?
|
|
|
|
|
1. Probably not. [.NET 7 is not supported in Windows 7](https://github.com/dotnet/core/issues/7556). Having said that, recompiling everything in .NET 6 should work, but you'd have to ensure WMIC and Powershell were installed too.
Windows 7 is simply too old for us to be supporting, but technically the code *should* be fine.
2. What is the minimum requirement to use this for text generation/chat? How much memory do I need and/or GPU?
It all works on CPUs, it's just slow. The more RAM the better. I run 16Gb systems on all my machines (except RPi/Jetson) and generative AI works fine on these. Even 8Gb should be good, but I would no go lower.
3. Installation needs to happen online so it can download the libraries needed for your particular Hardware / OS combo. Once it's installed it's fully offline.
4. Does your AI can process non-English language? Can I teach it based on my text data?
Whatever GGUF format LLama model you can find will work. Training yourself is a task beyond what we can help you with
cheers
Chris Maunder
|
|
|
|
|
My primary host is running UnRaid 6.12.9 with a CPAI docker container and a Windows VM running BlueIris. I have another Linux Mint machine on the network running CPAI with no docker container. Both instances of CPAI can see each other, sort of.
If I set Blue Iris to target the remote Linux machine, everything works fine, excess requests get passed back to the Unraid machine, processed, sent back. Everything is happy.
But if I set Blue Iris to target the docker running on the same machine as itself, it still sees the other Mint machine and tries to send overflow to it, but times out with no response.
The status of the Unraid server from the Mint remote wavers from active true to active false every few seconds.
I just upgraded to CPAI 2.6.5 but had the same behaviour under 2.6.2. UDP ports are open to the docker container.
I want to keep the docker on the Unraid as the primary instance because the remote mint machine is used frequently and is prone to going down for various reasons, and blue iris is unable to automatically switch servers.
From the Unraid Server:
Current Server mesh status
UnraidServer
Hostname: 172.17.0.8
System: Docker (Linux) Tesla P4
Platform: Docker
Active: true
Forwarding Requests: true
Accepting Requests: true
Visible Servers:
mintMachine
Routes Available: (16366 processed)
vision/custom 43.7ms (avg process time), 15756 processed
vision/custom/list 0ms (avg process time), 0 processed
vision/detection 0ms (avg process time), 0 processed
vision/face 20ms (avg process time), 610 processed
vision/face/match 0ms (avg process time), 0 processed
Remote Servers in mesh: 1
mintMachine
Hostname: mintMachine
System: Linux (Linux) NVIDIA GeForce GTX 1660 Ti with Max-Q Design
Platform: Linux
Active: true
Forwarding Requests: true
Accepting Requests: true
Visible Servers:
UnraidServer
Routes Available: (3 processed)
vision/custom 3000ms (avg round trip), 2 requests forwarded
vision/custom/list 0ms (avg round trip), 0 requests forwarded
vision/detection 0ms (avg round trip), 0 requests forwarded
vision/face 0ms (avg round trip), 1 requests forwarded
vision/face/match 0ms (avg round trip), 0 requests forwarded
From the Linux Mint remote machine:
Current Server mesh status
mintMachine
Hostname: mintMachine
System: Linux (Linux) NVIDIA GeForce GTX 1660 Ti with Max-Q Design
Platform: Linux
Active: true
Forwarding Requests: true
Accepting Requests: true
Visible Servers:
UnraidServer
Routes Available: (0 processed)
vision/custom 0ms (avg process time), 0 processed
vision/custom/list 0ms (avg process time), 0 processed
vision/detection 0ms (avg process time), 0 processed
vision/face 0ms (avg process time), 0 processed
vision/face/match 0ms (avg process time), 0 processed
Remote Servers in mesh: 1
UnraidServer
Hostname: 192.168.1.101
System: Docker (Linux) Tesla P4
Platform: Docker
Active: false
Forwarding Requests: true
Accepting Requests: true
Visible Servers:
mintMachine
Routes Available: (0 processed)
vision/custom 0ms (avg round trip), 0 requests forwarded
vision/custom/list 0ms (avg round trip), 0 requests forwarded
vision/detection 0ms (avg round trip), 0 requests forwarded
vision/face 0ms (avg round trip), 0 requests forwarded
vision/face/match 0ms (avg round trip), 0 requests forwarded
modified 31-May-24 14:55pm.
|
|
|
|
|
I think I might have got this licked.
I added the IP address of the Mint remote machine to the known mesh servers in appdata/codeprojectai/data/serversettings.json on the Unraid server
"KnownMeshHostnames": [ "192.168.1.103" ],
I already had it in the appsettings.json on the Mint machine pointing it at the Unraid server, so not sure if you need both pointing at each other, but it seems to be working.
Hopefully this helps out someone having the same issues as me!
|
|
|
|
|
Actually I'm still having some issues.
I haven't been able narrow down exactly when it happens, but it seems like after either end reboots the mesh is broken and it starts trying to look for the hostname instead of the IP address, which for whatever reason doesn't make it through the docker network interface, so you have to disable the mesh on the satellite, reboot the docker, then restart the mesh on the satellite again. Nothing relevant really comes up in the logs to give any insight on why this is happening.
|
|
|
|
|
I am attempting to run image codeproject/ai-server:cuda12_2 (current) under Docker running on Fedora 39. The server has abundant resources with 256 GB of RAM. As far as I know, Docker is not imposing memory limits. When I start the container, codeproject.ai starts normally and without errors. However, it crashes after 5 or 6 minutes with "out of memory","codeproject excited with code 139." The system log shows "systemd-coredump[1460173]: Process 1451539 (CodeProject.AI.) of user 0 dumped core.#012#012Stack trace of thread 882:#012#0 0x00007fbf944bc898 n/a (/usr/lib/x86_64-linux-gnu/libc.so.6 + 0x28898)#012#1 0x00007fafd2a00640 n/a (n/a + 0x0)#012ELF object binary architecture: AMD x86-64."
The container crashes whether or not it has been accessed, and whether or not it has claimed GPU resources. As long as it is running, it readily accepts images and performs comparisons, using about 1GB of GPU memory and around 3 GB of RAM. However, it still crashes.
I have searched and can't find anyone else with this problem, suggesting that it is something in my environment, but I can't figure out what it could be. I would appreciate any ideas.
|
|
|
|
|
Thanks very much for your report. Could you please share your System Info tab from your CodeProject.AI Server dashboard?
Thanks,
Sean Ewington
CodeProject
|
|
|
|
|
Server version: 2.6.5
System: Docker (ai-server)
Operating System: Linux (Ubuntu 22.04)
CPUs: AMD EPYC 7262 8-Core Processor (AMD)
2 CPUs x 8 cores. 16 logical processors (x64)
GPU (Primary): NVIDIA GeForce RTX 4060 (8 GiB) (NVIDIA)
Driver: 550.78, CUDA: 12.4 (up to: 12.4), Compute: 8.9, cuDNN: 8.9.6
System RAM: 252 GiB
Platform: Linux
BuildConfig: Release
Execution Env: Docker
Runtime Env: Production
Runtimes installed:
.NET runtime: 7.0.19
.NET SDK: Not found
Default Python: 3.10.12
Go: Not found
NodeJS: Not found
Rust: Not found
Video adapter info:
System GPU info:
GPU 3D Usage 2%
GPU RAM Usage 1.9 GiB
Global Environment variables:
CPAI_APPROOTPATH = <root>
CPAI_PORT = 32168
|
|
|
|
|
This one's beyond my pay grade. If you've configured Docker to have this much RAM it should be lavishing thanks on you, not core dumping. That's just inconsiderate.
I did see mention of hosting system core dumps when a Docker container hit it's assigned RAM max, but that doesn't seem to be the case here.
I wonder if it's not an out-of-memory issue, but rather a memory access / memory corruption issue?
cheers
Chris Maunder
|
|
|
|
|
Thanks for thinking about this. I also think it's a memory access issue, but why isn't everybody using this Docker container getting it? Docker provides such a consistent environment that it's really hard to figure out why it's only my Docker container that doesn't work. The amount of system resources is probably the biggest variable not controlled by the container, but as you point out, there is no shortage. I have watched the memory consumption using "docker stats" once a second, and the memory consumption does not gradually increase over the 5-10 minute lifetime of the container as you might expect it to with a memory leak.
|
|
|
|
|
Well, turns out it's probably a memory issue and not a memory access issue. Based on a whim and partly on a "docker out of memory" thread unrelated to ai-server, I limited the file handles in the docker-compose file as follows:
ulimits:
nofile:
soft: 65536
hard: 65536
and that appears to have resolved or at least mitigated the issue. To be clear, the number of files was unlimited prior to my change. The ai-server has been up more than 4 hours, which is 3 hours 50 minutes longer than it has ever run before. It is happily matching faces using only 3.1 GB of RAM. I have not yet tried to prove that the number of file handles increases until it consumes all of the memory, but I'm wondering if ai-server spends its free time grabbing file handles as fast as it can when they are unlimited.
It's still very curious that nobody else has reported this. Maybe it has to do with Fedora, but it seems to me that Docker running under Fedora should look the same as Docker running under any other distribution from inside the container.
I have some time to do further troubleshooting in the next few days.
|
|
|
|
|
I've got some answers.
Codeproject.ai-server does, in fact, continuously open new file handles at the rate of about 120/minute on my system, up to the limit if one exists. If there is no limit, it keeps going until it consumes all system memory. The reason Fedora is different (I think) is because Fedora made a decision not to impose limits on Docker itself due to the overhead of enforcing those limits, and suggests that limits be established on individual containers using cgroups instead. This "out of memory" error would inevitably occur on any distribution not enforcing file limits on docker by default. That may only be Fedora and Redhat at this time.
I reduced the file open limit to 1024 on ai-server and observed it for a while. It gets up to the limit, then bounces back down to about 440 files and starts over. It doesn't crash. The file handle that increases is a FIFO.
This is definitely a bug that needs to be addressed.
modified 7-Jun-24 14:13pm.
|
|
|
|
|
We had an issue that eventually lead to many file handles / watchers being set at startup. There's a check for this at startup and a warning issued, but as to it creating a bucket load more each second, that's bizarre. It would be handy to know which process is adding the handles: a module or the server itself.
Thanks,
Sean Ewington
CodeProject
|
|
|
|
|
A file handle is left open every time one of these child processes exits:
Quote: futex(0x55d234f6aba4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1671, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
write(62, "\21", 1) = 1
rt_sigreturn({mask=[]}) = 202
futex(0x55d234f6aba4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
futex(0x55d234f6aba4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1675, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
write(62, "\21", 1) = 1
rt_sigreturn({mask=[]}) = 202
futex(0x55d234f6aba4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1677, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
write(62, "\21", 1) = 1
rt_sigreturn({mask=[]}) = 202
futex(0x55d234f6aba4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1680, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
write(62, "\21", 1) = 1
rt_sigreturn({mask=[]}) = 202
futex(0x55d234f6aba4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1682, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
write(62, "\21", 1) = 1
rt_sigreturn({mask=[]}) = 202
|
|
|
|
|
Could you do me a favour please? The process id is given by si_pid (eg si_pid=1671). Can you do this for a process Id you've recently spotted?
- Identify the Process with PID 1671:
ps -p 1671 -o comm=
It should spit out the app name
- Identify the Parent Process:
ps -p 1671 -o ppid=
eg output will be '1234'
- Identify the Parent Process Name:
ps -p 1234 -o comm=
It should spit out the parent app name
- List Open Files for Parent Process:
lsof -p 1234
cheers
Chris Maunder
|
|
|
|
|
Chris, I already tried to figure out what was starting all those processes, but they don't last long enough. I have never seen one of the additional processes even with a ps aux. However, there may be another way to answer the question. I had already turned off all of the modules except Face Processing, so I turned that one off too. With no modules active, the FIFO file handles continue to accumulate. For what it's worth, lsof attributes all of the FIFOs to CodeProject.
|
|
|
|
|
I am more than happy to help troubleshoot in any way that I can. I suspect that any server, though, running from the same Docker image, is doing the same thing. I created a completely independent RPI instance using an RPI 4 with 8GB RAM and newly downloaded image (Linux pi8-rpi 6.6.31+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux). I added Docker and downloaded codeproject/ai-server:rpi64 then set up the docker-compose file exactly like the example on your site. In other words, it is completely vanilla.
The open files limit (by default) is 1048576. The CodeProject.AI.Server.dll process is doing exactly the same thing it does in my Fedora environments and at about the same rate:
Quote: futex(0x55a20d65b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=237742, si_uid=0, si_status=0, si_utime=0, si_stime=2} ---
write(64, "\21", 1) = 1
rt_sigreturn({mask=[]}) = 367791007160
The number of FIFO file handles increases until it hits the limit, then drops back to around 800 (probably less because I'm probably not seeing the lowest number).
|
|
|
|
|
It does make sense that it would be the server, but thanks for confirming that.
Do you have the Explorer and/or dashboard open when you're seeing file handles grow? If so, and if you close both, do the file handles stabilise?
My guess is it's TCP/IP connections. The question is: where?
cheers
Chris Maunder
|
|
|
|
|
Whether the dashboard is open has no effect on the generation of the file handles. In fact, neither does activity. The number of file handles grows at the same rate on a newly started server with no clients and no GUI connection. On a system with no ulimit set, a server will crash when memory runs out even if it has no interaction at all with the outside world. I can't absolutely confirm this, but the growth seems to be exponential rather than linear, or at least inconsistent, with a huge increase just before the host OS shuts it down.
I would add one more thing because I think it is related. There are several issues on this site related to ai-server becoming unresponsive. I have seen the same thing, and when it happens, the server stops spewing out new processes and FIFO file handles. It isn't an inevitable consequence of uptime, and I have not been able to figure out what, if anything, is triggering it.
modified 9-Jun-24 19:04pm.
|
|
|
|
|
Do you have mesh processing enabled? If so, can you disable that please?
cheers
Chris Maunder
|
|
|
|
|
I do not have mesh processing enabled. Have you checked one of your own servers for this issue? Since it happened on my isolated vanilla rpi installation, it is probably happening on all instances.
modified 12-Jun-24 16:07pm.
|
|
|
|
|
Is this bug going to be fixed?
|
|
|
|
|
We absolutely want to have this issue fixed, but unfortunately right now we're extremely time constrained.
General call:
Anyone else good with identifying file handle leaks in .NET apps in Linux?
cheers
Chris Maunder
|
|
|
|
|