Test4Theory

Made a small script to keep an eye on Theory jobs with correct % done

14 hours 32 minutes ago
This might be helpful -
/var/lib/boinc/slots/2/stderr.txt
### trailing progress lines of the first run - ### 2025-06-07 19:02:25 (39258): Status Report: Job Duration: '864000.000000' 2025-06-07 19:02:25 (39258): Status Report: Elapsed Time: '90000.000000' 2025-06-07 19:02:25 (39258): Status Report: CPU Time: '88924.030000' 2025-06-07 20:41:27 (39258): Status Report: Job Duration: '864000.000000' 2025-06-07 20:41:27 (39258): Status Report: Elapsed Time: '96000.000000' 2025-06-07 20:41:27 (39258): Status Report: CPU Time: '94870.180000' 2025-06-07 20:55:39 (39258): Stopping VM. 2025-06-07 20:55:46 (39258): Successfully stopped VM. ### 5 hour pause ### ### second run start - ### 2025-06-08 02:12:31 (2771): vboxwrapper version 26210 2025-06-08 02:12:31 (2771): BOINC client version: 8.0.4 2025-06-08 02:12:34 (2771): Detected: VirtualBox VboxManage Interface (Version: 7.1.8) ### vbox/start-up posts omitted ### ### first 6 progress line of second run - ### 2025-06-08 02:12:46 (2771): Status Report: Job Duration: '864000.000000' 2025-06-08 02:12:46 (2771): Status Report: Elapsed Time: '96860.000000' 2025-06-08 02:12:46 (2771): Status Report: CPU Time: '95722.470000' 2025-06-08 03:51:48 (2771): Status Report: Job Duration: '864000.000000' 2025-06-08 03:51:48 (2771): Status Report: Elapsed Time: '102860.000000' 2025-06-08 03:51:48 (2771): Status Report: CPU Time: '101672.980000' ### last 3 progress lines - ### 2025-06-08 12:07:02 (2771): Status Report: Job Duration: '864000.000000' 2025-06-08 12:07:02 (2771): Status Report: Elapsed Time: '132860.000000' 2025-06-08 12:07:02 (2771): Status Report: CPU Time: '131240.800000'

New tasks all failing?

1 month ago
Most of theory task take a very short time to be accomplished (less than a minute) and the Jobs chart show a high rate of failure (~60-70%). How to know if a task as due been completed ?

025-05-02 11:47:07 (10908): Guest Log: Environment HTTP proxy: not set
2025-05-02 11:47:08 (10908): Guest Log: job: htmld=/var/www/lighttpd
2025-05-02 11:47:08 (10908): Guest Log: job: unpack exitcode=0
2025-05-02 11:48:40 (10908): Guest Log: job: run exitcode=1
2025-05-02 11:48:40 (10908): Guest Log: job: diskusage=5704
2025-05-02 11:48:40 (10908): Guest Log: job: logsize=4 k
2025-05-02 11:48:40 (10908): Guest Log: job: times=
2025-05-02 11:48:40 (10908): Guest Log: 0m0.002s 0m0.004s
2025-05-02 11:48:40 (10908): Guest Log: 0m0.424s 0m0.248s
2025-05-02 11:48:40 (10908): Guest Log: job: cpuusage=1
2025-05-02 11:48:40 (10908): Guest Log: Job Finished
2025-05-02 11:48:40 (10908): Guest Log: boinc_shutdown called with exit code 0
2025-05-02 11:48:40 (10908): Guest Log: sd_delay: 845
2025-05-02 11:48:40 (10908): Guest Log: ETA: 2025-05-02 10:02:44 UTC
2025-05-02 12:02:45 (10908): VM Completion File Detected.
2025-05-02 12:02:45 (10908): Powering off VM.
2025-05-02 12:02:45 (10908): Successfully stopped VM.
2025-05-02 12:02:45 (10908): Deregistering VM. (boinc_1114d9ba70cb8796, slot#0)
2025-05-02 12:02:46 (10908): Removing network bandwidth throttle group from VM.
2025-05-02 12:02:46 (10908): Removing VM from VirtualBox.
2025-05-02 12:02:51 (10908): called boinc_finish(0)

ERROR: failed to run pythia8 8.313

1 month 2 weeks ago
I will note that not all pythia8 tasks are failing. Very rarely do I get to sit down and look at what the VM is actually doing. But I've found some tasks give that error and some that don't.

CP5-CR2 is generating events
default-CD is generating events
default-noRap is generating events

qcdcr0 failed (by failed I mean they didn't generate any events)
tune-A2 failed I got 2 of them that failed.
tune-AU2 failed
tune-AU2lox failed
vincia-default failed (X2)
ropes failed

I haven't gotten any pythia6, sherpa or herwig tasks so I don't know about those.

Windows 10 Theory task stalled or ...?

1 month 3 weeks ago
2025-04-16 20:55:35 (9456): Guest Log: [INFO] Excerpt from "cvmfs_config stat": VERSION HOST PROXY
2025-04-16 20:55:35 (9456): Guest Log: [INFO] 2.7.2.0 http://s1ihep-cvmfs.openhtc.io:8080 http://192.168.1.125:3128
These lines confirm:
- that openhtc.io is used for CVMFS (good!)
- that your local proxy 192.168.1.125 is used (even better!)

New native version v300.08

2 months ago
By reading the page you mention I don't understand if this BUDA thingy is "only for the server side of things" or also for the boinc client ? like we would not need VB anymore and boinc would run a packaged boinc application via docker on the participant machine ?

New version v300.94

2 months ago
It's a bit more complex.

ATLAS is still using an outdated vboxwrapper (don't know when they replace it).

VirtualBox uses the profile set by the first vbox component starting up.
This can be the GUI, vboxmanage or vboxheadless.
The profile is in use until the background process vboxsvc times out, usually a couple of seconds after the components just mentioned are finished.

This means:
When BOINC starts a vbox app using an old vboxwrapper first, the wrong profile location will be used.
When BOINC starts a vbox app using a new vboxwrapper (>=26210) first, the correct profile location will be used.

This is a per user limitation.
So, this affects all vbox tasks from various projects running under the same user account.
Simple rule: first vboxwrapper wins

A temporary workaround could be:
Replace the old vboxwrapper file with the recent one but keep the old name.
In the <options> section of cc_config.xml set <dont_check_file_sizes>1</dont_check_file_sizes>.
Then reload config files.

This shouldn't be used long term as you won't get automatic app updates for all projects any more.

Theory task fail "finished with status code 1"

2 months 1 week ago
It seems I have the same issue on an Intel iMac:

a<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
2025-03-30 16:55:27 (1462): vboxwrapper version 26208
2025-03-30 16:55:27 (1462): BOINC client version: 8.0.2
2025-03-30 16:55:27 (1462): Detected: VirtualBox VboxManage Interface (Version: 7.1.6)
2025-03-30 16:55:27 (1462): Detected: Sandbox Configuration Enabled
2025-03-30 16:55:27 (1462): WARNING: Communication with VM Hypervisor failed.
2025-03-30 16:55:27 (1462): ERROR: VBoxManage list hostinfo failed
2025-03-30 16:55:27 (1462): called boinc_finish(1)

</stderr_txt>
]]>

https://lhcathome.cern.ch/lhcathome/result.php?resultid=420829321

New Version v300.80

2 months 3 weeks ago
[edit] The 10 day estimate is dropping quite fast. Let's hope it reaches realistic values before task finishes.

Most of the newest version tasks all seem to be completing in 2 hours or less, though I do have one right now that's been running for 5-1/2 hours.
The runtime estimate stopped dropping quite soon and stabilized at run time + time left = 240 hours.

New Version v300.70

2 months 3 weeks ago
the latest version seems to work now, but what surprises me is what the log shows after testing the local proxy:

[INFO] /sbin/bootstrap: line 41: nc: command not found
[INFO] 127
[INFO] Local proxy can't be contacted and and will be ignored

how come?

How long may Native-Theory-Tasks run

3 months ago

On which level?

I understand that this is a complicated environment and a lot of things is outside my understanding. As a user I do everything I can to make it run well and for me a simple errorhandling is to abort zombie jobs automatic. It does not matter if it is user error or app errors, if someone like me can see in the log files that it has gone wrong it's easy to believe that the system knows it has gone wrong and should terminate. When terminated, a job has error status and it's easy to check the stderr and see what gone wrong, most likely user error :)

Your answers is very much appreciated - now the sudoers issue has been solved, found https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6075&postid=48978 I hope it's the correct one.

Can't do anything about the cgroups v1 issue as all modern Linux dialects is using v2. It is possible to reverse but it causes a lot of other issues in the system.

I don't know why Boinc has tried to pause the jobs. I don't allow any extra jobs to be downloaded so it should not try: <work_buf_min_days>0</work_buf_min_days> \ <work_buf_additional_days>0</work_buf_additional_days> I never pause any jobs manually these systems run 24/7 with only LHC no other Boinc projects.

The three jobs I had to cancel if you would get a minute to check if there is anything else I should fix:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419879823
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419848816
https://lhcathome.cern.ch/lhcathome/result.php?resultid=419817476

I was working as sysadmin and software developer 30-35 years ago and have done a lot since. Now retired and have this as one of my hobbies so I'm not totaly lost with computers :)
Checked
Test4Theory
LHC@home: Theory Application
Subscribe to Test4Theory feed