Message boards : Number crunching : Progess going backwards?
Author | Message |
---|---|
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Here's a glitch I haven't seen before. I happened to be looking at another one of those tasks that gets stuck over 8 hours. This machine has two of them right now, both approaching 12 hours. I wouldn't mind that so much except that as near as I can tell, the 12-hour tasks don't get much if any extra credit for the time. Suddenly the progress of one of the tasks dropped by about 10%, from around 98% to 88%. After that, it started to make rapid upward progress, in the normal jumps rather than the 0.001% jumps of a stuck task. A few minutes later, both of them cleared, so I didn't see exactly what happened. As noted before, the credit is not that important. However the apparent bugginess of some parts of the system (for calculating the progress and the credit) cast doubts on the more important parts of the system that are supposed to be calculating significant scientific results. Why should other scientists believe that since the known bugs don't matter much there aren't other bugs of greater consequence? #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
What work units are having issues? It may be specific to the particular job(s). Maybe a large protein? Or particular score filters that are failing more often than usual for the particular protein. Or maybe boinc checkpointing is not properly implemented for the particular protocol. The core molecular modeling software is developed, tested, and used by many academic institutions around the world through the Rosetta Commons. Rosetta is freely available to academics including the source code. I would not jump to a conclusion that the issues you mention may also reflect the science. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Well, it might be a BOINC-level problem since it affects many of the work units, even including some of the rb units that used to run in 4 hours. These days there are occasional work units under 8 hours, but rarely. When I've looked at those fast units, the credit usually seems to be appropriately reduced in accord with their shorter run times. For a while I was trying to see if there were some specific project name associated with the the 12-hour units, but couldn't figure out any pattern. Some of the tasks just go into a slow progress mode with a remaining time around 10-1/2 minutes. The progress will be advancing in very small increments, usually 0.001% at a time, which is about 10 or 15 seconds. The remaining time just stays constant, with an occasional 1-second flick of a smaller time. Usually it goes down by one second and then it flips right back up. The checkpoint problems are different, but continuing. Haven't noticed as many of them these days as I used to, but I've also stopped paying so much attention. There are definitely times when I find that some task has not been checkpointed for a long time, but usually they are within 5 minutes, except at the beginning, when it often takes longer for the first checkpoint. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Do you have a checkmark for the "Leave applications in memory while suspended" computing preference? If not, and BOINC Manager decides to transition to another project, the task may lose progress when suspended. I would expect such a loss would not be displayed until the task is resumed. Rosetta Moderator: Mod.Sense |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
I'm guessing you [Mod.Sense] mean "Leave non-GPU tasks in memory while suspended" on the "Disk and memory" tab. It is checked, but the task was NOT suspended when I noticed the drop in progress. I'm also hard pressed to imagine how it could have lost progress even if it had been suspended. The progress is somehow related to time rather than being a simple metric of work completed? The remaining is obviously related to time and I can see how suspension might confuse that one, but it's already an obviously flaky and nonlinear metric of whatever it's estimating. Current annoyance is actually the checkpointing, especially on this machine. Whenever I want to shut it down, it seems like at least one of the active tasks has a large time since the last checkpoint. Right now I have a task that is almost half finished as it approaches 5 hours, but it's been more than 1-1/2 hours since the last checkpoint. (This one is a nRoCM... task, if that's worth knowing.) Can't suspend this machine because it's a cross-booter and I need the other OS sometimes. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Message boards :
Number crunching :
Progess going backwards?
©2024 University of Washington
https://www.bakerlab.org