Message boards : Number crunching : Work units get stuck in some CPU loop and take days....
Author | Message |
---|---|
Dtallguy® Send message Joined: 15 Jan 12 Posts: 3 Credit: 32,566,988 RAC: 0 |
I've had to cancel 3 work units in the last the few days because they're up past 24 hours of apparently wasted CPU time, the progress is not advancing and their remaining time keeps going UP!!! What is going on here? Dtallguy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
If you see such behavior again, please select the work unit and display it's properties, and see what shows there for CPU time (as compared to the "wall-clock" time that is shown in the BOINC Manager. Tasks are setup to finished based on actual CPU time they are given. If your machine has higher priority work going on or whatever, a WU can be active for hours and not really get any CPU. You should also display the Windows task manager (or top if your on Linux) and see what the top CPU usage on the machine is. Perhaps some other application is stuck in a loop and running at a higher priority. If you were not aware of it, Rosetta@home has a Rosetta-specific preference for you to specify a preferred task runtime. This is just a target, not a guarentee. But if you set it to the 3 hour default, or higher, it generally comes in pretty close. The maximum you can set is 24hrs. And if a task tries to run longer than that by more than 4 hrs, it will be ended. But again, this is based on actual CPU time, not run time. Also, if there is a pattern to the names of the tasks, that would be good to know. Do you have other tasks running normally? Rosetta Moderator: Mod.Sense |
Dtallguy® Send message Joined: 15 Jan 12 Posts: 3 Credit: 32,566,988 RAC: 0 |
Thanks for getting back to me so quickly! I have a quad core and all 4 cores are actively working Rosetta work units. This machine is idle while I'm at work (partly why I wanted to get Rosetta running on it). There DID appear to be similar names to the tasks, which have now, unfortunately, been deleted as I have a new batch of work units. All other work units are well within the 3 hour target timeline. I'll keep a close eye on this and if I see another workunit taking abnormal time again, I'll post the actual work unit number for review. All other functions on the computer are working fine and I'm playing music and surfing with no appreciable difference. When I'm away from the PC the monitor is set to shut off after 5 minutes and no other activities are left open, so Rosetta/Boinc has full use of the resources. Thanks again! Dtallguy® |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
This sounds as if it might be the same issue that's cropped up before, see for example this thread. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5815. The workaround is to quit and restart BOINC. It would be really nice if this problem got fixed, but its irreproducible nature must make it very hard to track down. |
Dtallguy® Send message Joined: 15 Jan 12 Posts: 3 Credit: 32,566,988 RAC: 0 |
One of my older machines (3.4G Single Core) was also showing goofy times (24.2 hours elapsed and 88.3 hours remaining with the percentage done - 21%) The actual CPU time was just over 2 hours. I stopped all processes and restarted Boinc and lo and behold the values got sorted out and everything appears normal now. So yes, this appears to be an issue with the Boinc client itself. I'll continue monitoring this and update if I find anything new. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,181,510 RAC: 3,269 |
One of my older machines (3.4G Single Core) was also showing goofy times (24.2 hours elapsed and 88.3 hours remaining with the percentage done - 21%) The actual CPU time was just over 2 hours. I stopped all processes and restarted Boinc and lo and behold the values got sorted out and everything appears normal now. So yes, this appears to be an issue with the Boinc client itself. I'll continue monitoring this and update if I find anything new. This is a LONG STANDING quirky issues that pops up so rarely the powers that be haven't been able to fix it yet. There are almost 2.5 million of us active crunchers, across MANY projects, so it SEEMS to happen a lot of but it really is rare. IF you can MAKE it happen, please say what you did, otherwise it is one of things that will get fixed one day, but the easy 'fix' right now is to exit Boinc and then restart it and all of a sudden it is okay again, as you noticed. |
Message boards :
Number crunching :
Work units get stuck in some CPU loop and take days....
©2024 University of Washington
https://www.bakerlab.org