Aggregator

Theory simulation takes way too long

1 month 3 weeks ago
Another example of how waiting is part of running a Valid task when you watch the running log.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=400796623

I admit that this doesn't happen that often and last time I got one this long or longer was back at Test4Theory

Computer ID 10451775
Run time 7 days 17 hours 36 min 8 sec
CPU time 7 days 16 hours 26 min 13 sec
Validate state Valid
Credit 6,451.85

21:22:17 CET +01:00 2024-01-25: cranky-0.1.4: [INFO] mcplots runspec: boinc pp z1j 13000 75 - pythia8 8.244 CP1-CR1 100000 66
13:20:35 CET +01:00 2024-01-31: cranky-0.1.4: [INFO] Container 'runc' finished with status code 0.

Computer ID 10816264
Laufzeit 5 Tage 13 Stunden 41 min. 0 sek.
CPU Zeit 2 Tage 18 Stunden 47 min. 18 sek.
Prüfungsstatus Gültig
Punkte 6,383.10

Yes, waiting for max. 10 days for Theory tasks is possible.
Don't know the difference between CPU-Time and running Time.

Disk usage limit exceeded

1 month 4 weeks ago
Yes sherpa can be that way for sure but I do get Valid ones with them once in a while and most of the nice long Theory tend to be a pythia 6 or 8 and many of those Valid after 7 days and I tend to save some of those when I catch them.
I tend to watch and check any of my VB tasks 99.8% of the time. (I should find out which host I have ran the Atlas tasks since they don't save them in our account stats here like at -dev so I can look in the Bonc files I hope since Win 11 can be a pain in the ) I just started a new batch and one is a sherpa so if anything happens I will switch on over to the Theory page)

No such file or directory

1 month 4 weeks ago
Your computer runs ATLAS native which requires a local CVMFS client to be installed and correctly configured.

That's what your early logs state:
[2024-01-26 22:35:03] ** It looks like CVMFS is not installed on this host.


More recent logs state this:
[2024-01-28 13:23:34] Checking for CVMFS [2024-01-28 13:23:34] Probing /cvmfs/atlas.cern.ch... Failed! [2024-01-28 13:23:34] Probing /cvmfs/atlas-condb.cern.ch... Failed!




Stop requesting fresh work until you solved the CVMFS misconfiguration.
Otherwise all tasks will fail.

Get help here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5594
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5595

atlas error

2 months ago
2024-01-25 18:53:43 (28384): Guest Log: [INFO] Probing /cvmfs/atlas.cern.ch... OK
2024-01-25 18:53:43 (28384): Guest Log: [INFO] Detected branch: prod
2024-01-25 18:56:39 (28384): Guest Log: [DEBUG] Failed to copy ATLASJobWrapper-prod.sh
2024-01-25 18:56:39 (28384): Guest Log: [DEBUG] VM early shutdown initiated due to previous errors.

Thank you and goodbye!

2 months 2 weeks ago
Given the sporadic nature and different characteristics of each batch, I'm hopeful that the prod jobs are released by human for actual science. Would be good to confirm for sure.

Buugy workunit

2 months 2 weeks ago
I have had many long ones over the years but this one was here a couple months ago

Native Theory Application Setup issue

2 months 2 weeks ago
... I suggest to be a bit more strict and modify the original sudoer pattern as follows:
1. As root edit "/etc/sudoers.d/50-lhcathome_boinc_theory_native"
2. Locate the alias "LHCATHOMEBOINC_03"
3. Replace "...runc --root state..." with "...(runc|runc\.new|runc\.old) --root state..."
4. Save the file
As of today the updated script creating the sudoers file is available on the project servers (-dev and -prod).
The modified script now creates the correct command alias.

Volunteers who already modified the sudoers file do not need to run the script again.

Volunteers who run the script again will find a backup of the old sudoers file in /etc/sudoers.d beside the new file.
Feel free to leave the backup there or delete it.

Sudo will automatically ignore the backup file and use the new file as soon as it is available.
Sudo version >= 1.9.10 still remains a requirement.

Native Theory Application Setup (Linux only)

2 months 3 weeks ago
Important note!

This thread needs to be revised for Theory native >=300.08.

The method/scripts suggested for suspend/resume should not be used any more as they require cgroups v1.
Recent linux systems use cgroups v2 as default.

Theory native >=300.08 supports cgroups v2 as default while cgroups v1 support is deprecated.


Meanwhile please use the advice given in this thread to enable suspend/resume on a cgroups v2 based system.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6075&postid=48978

ConsoleWrap Error

2 months 3 weeks ago
You got 10 CMS-tasks and probably you started all 10 at once. Vbox don't like starting many machines in the same second.

Maybe the other 6 are still running. Hopefully they are running well.

Oh, thank you. Yes, other 6 runned correctly.
I'll keep an eye out

CMS (vbox) tasks failing

3 months ago
Ok, I've noticed two things, neither of which might be relevant or they could be ... I don't know.

I ran out of Atlas tasks. Every CMS task that failed did so while I also had Atlas tasks (three at a time) running.
I checked the permissions on the Slots folder, and issued "chmod +wrx -R slots" to address a suspected inconsistency.
The next CMS task I allowed to start has been running for over ten hours now so I started eleven more to push the issue and those additional tasks have all been running for about an hour without issue.

There are also two long running Theory Native tasks currently running and the rest of my 24 threads are taken up with Asteroids tasks (nine) and one thread kept free for system stuff.

Multithreading/Multicore?

3 months 1 week ago
I have no problem with suspending CMS units. My desktop is usually up 24/7 and I only occasionally reboot due to some update that requires it I assume by doing this, your tasks are being suspended only for short time.
As computezrmle wrote above, suspending up to2 hours should not be a problem anyway.

Theory Error fail to compile yoda2flat-split

3 months 1 week ago
Thanks, it is good to report the error.

Since it is caused by a misconfiguration at a deeper level it can only be solved by the CERN team providing the affected scientific app.
No action required at the BOINC side.