HOAB

History of a bug

Cannot create GC thread but a lot of memory

Rédigé par gorki Aucun commentaire

Problem :

Launching a JVM I have the message : "Cannot create GC thread. Out of system resources"

  • Enough memory
  • Enough swap
  • Enough ulimit
  • Enough threads-max
  • Enough CPU

Event extend the PID limit...

Important (at the end) : debian version = 10.11

Solution :

After a hours of googling, I found :

But none of these solutions works and none was matching the number I had :

  • number of open files < ulimit -n
  • maximum process/tasks < ulimit -u

But in a thread, I found something that was working : UserTasksMax.
I'm running SystemD, I have around 10805 task running for my user.
And from : https://manpages.debian.org/stretch/systemd/logind.conf.5.en.html

UserTasksMax=

Sets the maximum number of OS tasks each user may run concurrently. This controls the TasksMax= setting of the per-user slice unit, see systemd.resource-control(5) for details. If assigned the special value "infinity", no tasks limit is applied. Defaults to 33%, which equals 10813 with the kernel's defaults on the host, but might be smaller in OS containers.

For my suspect PID (a lot of files) :

  • cat /proc/21890/status | grep Thread => 1 thread
  • ls /proc/21890/task | wc
  • confirmed by the usual command : ps -eLf | grep calrisk | wc

I have around 10805 threads running for a given JVM very close to the limit.

Complete guide :

https://www.journaldufreenaute.fr/nombre-maximal-de-threads-par-processus-sous-linux/

Parameters not present in all man page, it could grown up to 12288 on latest version.

To be check !

 

 

Upgrade debian et lost network

Rédigé par gorki Aucun commentaire

Problem :

I manage a dedicated server in OVH and I upgrade my debian from jessie to buster. Upgrade works quite well (it seems...) and I try to restart.

Server reboot fails as unreachable, fortunately OVH rescue mode allows me to login.

I check error log and first lost myself in RAID error message, but it was more simple than that.

Solution :

I check the /etc/network/interfaces file, it was OK

I check the logs files, clean, reboot, check again, still OK except that network was unreachable for named.

I finally remember that Debian switch to systemD in latest version so I tried to create system networking file manually : too complicate, it was not working.

In rescue mode, you can access your files as a mounted point so usual commands as systemctl does not work.

The solution was to chroot a shell :

  1. mkdir /mnt/md2
  2. mount /dev/md2 /mnt/md2
  3. chroot /mnt/md2 bash
  4. systemctl enable networking

And it works...

Now I have to check all other system to be sure that everything is working...

Begining with :

sudo apt-get update

sudo apt-get clean

sudo apt-get autoremove

sudo apt-get update && sudo apt-get upgrade

sudo dpkg --configure -a

 

SystemD and tomcat hang on startup

Rédigé par gorki Aucun commentaire

Problem :

I used robertdebock/ansible-role-tomcat to install a Tomcat instance using Ansible. Works well until I deploy an application on it. Then java process hangs with 100% system CPU.

Starting with tomcat users without system work correctly.

Solution :

I suspected :

  • SELinux
  • Linux limits
  • VM slow I/O

But after a while I ran strace :

  • by modifying systemd configuration
  • by modifying catalina.sh configuration

All I have was a simple FUTEX wait...

And then I read the manual, as simple as :

strace -f -e trace=all -p <PID>

No need to trace from startup and by default, not all is traced...

After that, easy way, the process was reading recursively :

/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/8156...

Just fixing the working_directory in the ansible role, and all is working.

Issue reported here.

 

Tomcat, NIO, Hanging et CLOSE_WAIT

Rédigé par gorki Aucun commentaire

Problem :

We are testing a springboot application in AWS with ELB in front.

After a while of load-testing, the application was hanging :

  • HTTP 504 error code from Jmeter client
  • HTTP 502 if we raise ELB timeout
  • Once logged on the server :
    • telnet localhost 8080 was OK
    • sending GET / on this socket was not responding
    • plenty of CLOSE_WAIT socket
    • wget was also hanging (normal)
    • connection was established during wget hang
    • nothing in the log

 

Solution :

 

I initially think about the keepAlive timeout and pool of tomcat but

  1. SpringBoot copy the connectionTimeout parameter to keepAliveTimeout
  2. new socket is accepted and established
  3. CLOSE_WAIT wasn't shutdown after hour

Doing the test many times, I finally so a classical "Too many open files" in the log. That's why I could not see more log during the hang.

So we change the nproc and nofile in /etc/security/limits.conf

And taadaaaa ! Nothing change in :

cat /proc/<$PID>/limits

Thanks to blogs over the world like this one :

  • the service is start with systemd
  • to override ressources limits with systemd :
[Service]
...
LimitNOFILE=500000
LimitNPROC=500000

At last but not least, the value of Tomcat NIO socket queue is around 10000 + other files + other process... choose wisely your limit

Fil RSS des articles de ce mot clé