Aggregator

would task restart from scratch after a machine crash?

4 weeks ago
Would the machine crash have caused the Theory task to completely restart?When a PC crashed and/or BOINC is not properly shutted down, VirtualBox is not able to save the VM-state to disk.
After BOINC-startup the task errors out or when you're lucky/unlucky, the VM is starting the job from the beginning.
The progress of the task in runRivet.log on disk is not updated any longer, but the progress can still be seen with BOINC Manager's "Show Graphics" .

BTW: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=238232604 was a resend and the original client returned a valid result a bit too late. I would abort that task.

no new WUs available

1 month ago
1/10/2026 3:16:01 PM | LHC@home | No tasks are available for CMS Simulation

(after trying for an hour I finally got the one I was trying to get) this seems to be a common problem lately. Happens also here once in while. In most cases it then takes between 20 and 30 minutes until tasks finally come in. No idea what's going on.

This gonna be long

1 month 1 week ago
Your 2 linked tasks: The first one was too late but on time before your resend turned in. The second task may have restarted task's VM several times maybe starting from scratch.


Both tasks are still running and it's not clear whether:
- the first task, which already had a valid result, will grant me credits if I finish it?
- if I won't finish the second task in 11 days, will it get cancelled and will I lose all 11 days of running time? If a third replicate gets sent out after 10 days I don't think anyone will finish it within 24 hours (before my hard deadline) so that shouldn't be an issue.

Edit: after re-reading your reply several times I think I figured out the misunderstanding. In both WUs that I linked, I'm running the resends, not the initial tasks.

Lost in Atlas......

1 month 3 weeks ago
CMS mostly seem to be working ok.
That's wrong.
Your CMS VMs are running empty tasks without any scientific value.
As said, this is because of an error in CERN's backend queue which does not send out any scientific job.
You can't do anything against it as it must be solved by CERN staff after their holidays.

Indicators are:
1. short runtimes
2. CMS Grafana pages:
https://lhcathome.cern.ch/lhcathome/cms_job.php

https://monit-grafana.cern.ch/d/o3dI49GMz/cms-job-monitoring-12m?viewPanel=49&orgId=11&var-group_by=CMS_JobType&var-Tier=All&var-CMS_WMTool=All&var-CMS_SubmissionTool=All&var-CMS_CampaignType=All&var-Site=T3_CH_Volunteer&var-Site=T3_CH_CMSAtHome&var-Type=All&var-CMS_JobType=All&var-CMSPrimaryDataTier=All&var-adhoc=data.RecordTime%7C%3E%7Cnow-7d&var-ScheddName=All&from=now-7d&to=now


If you want to deliver work with scientific value, switch to Theory.

New 1000 event tasks

2 months ago
Same here too. Got more than a dozen tasks cancelled while running for hours (some >50% in progress). Some did get cancel before the tasks ran and I'm fine with that.

In addition, got tasks with validation error but it was only a few minutes of running, so that's not as bad when compare to those already running for hours and then got cancelled.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237892957
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237896161
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=237891554

Events count less easily monitored: eventLoopHeartBeat.txt stays stuck.

2 months 3 weeks ago
Hi!
I've been away for a while. Now I see that the file eventLoopHeartBeat.txt in the [...]/boinc-client/slots/?*/PanDA_Pilot-* directory is no more constantly updated, so it always reports "1 event read so far". It's possible to find multiple updated eventLoopHeartBeat.txt files, one for each worker, in [...]/boinc-client/slots/?*/PanDA_Pilot-*/athenaMP-workers-EVNTtoHITS-sim/worker_?* subdirs. However you have to sum up the number of events to get the total...

I don't think this has been done on purpose, am I wrong?
--
Bye, Lem

Hung Theory task?

2 months 3 weeks ago
There's no 'obvious error' reported back to the project.
In cases like that there is no log file from the scientific app sent back to the project.
Hence, there is nothing to analyse and the task is either marked as 'failed' or 'lost' after the due date.

Even the log snippets you posted do not clearly explain if/why the tasks got stuck.

So, how should the project decide what caused the failure.
It could be either (may be incomplete):
- hardware
- the OS
- VirtualBox
- BOINC
- vboxwrapper
- data from CVMFS
- scientific app

From the project's perspective there's only the overall task failure rate for the computer itself.
As already mentioned for this computer it is less than 1 % covering all possible reasons.

Theory CPU Scheduling oddness

3 months 2 weeks ago
This is a bug in VirtualBox 7.2.4.

On a computer with AMD CPU there's no known workaround so far.
...
After more testing...
Looks like the downgrade left the 7.2.4 kernel module on the system.
It now works after a cleanup and a fresh 7.2.2 installation (package from VirtualBox).

The kvm_amd module must remain blacklisted.