On which level?
I understand that this is a complicated environment and a lot of things is outside my understanding. As a user I do everything I can to make it run well and for me a simple errorhandling is to abort zombie jobs automatic. It does not matter if it is user error or app errors, if someone like me can see in the log files that it has gone wrong it's easy to believe that the system knows it has gone wrong and should terminate. When terminated, a job has error status and it's easy to check the stderr and see what gone wrong, most likely user error :)
Your answers is very much appreciated - now the sudoers issue has been solved, found
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6075&postid=48978 I hope it's the correct one.
Can't do anything about the cgroups v1 issue as all modern Linux dialects is using v2. It is possible to reverse but it causes a lot of other issues in the system.
I don't know why Boinc has tried to pause the jobs. I don't allow any extra jobs to be downloaded so it should not try: <work_buf_min_days>0</work_buf_min_days> \ <work_buf_additional_days>0</work_buf_additional_days> I never pause any jobs manually these systems run 24/7 with only LHC no other Boinc projects.
The three jobs I had to cancel if you would get a minute to check if there is anything else I should fix:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419879823
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419848816
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419817476
I was working as sysadmin and software developer 30-35 years ago and have done a lot since. Now retired and have this as one of my hobbies so I'm not totaly lost with computers :)