Work units get stuck in some CPU loop and take days....

Message boards : Number crunching : Work units get stuck in some CPU loop and take days....

To post messages, you must log in.

AuthorMessage
Dtallguy®

Send message
Joined: 15 Jan 12
Posts: 3
Credit: 32,566,988
RAC: 0
Message 72209 - Posted: 24 Jan 2012, 3:28:20 UTC

I've had to cancel 3 work units in the last the few days because they're up past 24 hours of apparently wasted CPU time, the progress is not advancing and their remaining time keeps going UP!!! What is going on here?

Dtallguy
ID: 72209 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72210 - Posted: 24 Jan 2012, 4:28:10 UTC

If you see such behavior again, please select the work unit and display it's properties, and see what shows there for CPU time (as compared to the "wall-clock" time that is shown in the BOINC Manager. Tasks are setup to finished based on actual CPU time they are given. If your machine has higher priority work going on or whatever, a WU can be active for hours and not really get any CPU.

You should also display the Windows task manager (or top if your on Linux) and see what the top CPU usage on the machine is. Perhaps some other application is stuck in a loop and running at a higher priority.

If you were not aware of it, Rosetta@home has a Rosetta-specific preference for you to specify a preferred task runtime. This is just a target, not a guarentee. But if you set it to the 3 hour default, or higher, it generally comes in pretty close. The maximum you can set is 24hrs. And if a task tries to run longer than that by more than 4 hrs, it will be ended. But again, this is based on actual CPU time, not run time.

Also, if there is a pattern to the names of the tasks, that would be good to know. Do you have other tasks running normally?
Rosetta Moderator: Mod.Sense
ID: 72210 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dtallguy®

Send message
Joined: 15 Jan 12
Posts: 3
Credit: 32,566,988
RAC: 0
Message 72211 - Posted: 24 Jan 2012, 4:46:07 UTC - in response to Message 72210.  

Thanks for getting back to me so quickly! I have a quad core and all 4 cores are actively working Rosetta work units. This machine is idle while I'm at work (partly why I wanted to get Rosetta running on it).
There DID appear to be similar names to the tasks, which have now, unfortunately, been deleted as I have a new batch of work units. All other work units are well within the 3 hour target timeline. I'll keep a close eye on this and if I see another workunit taking abnormal time again, I'll post the actual work unit number for review.
All other functions on the computer are working fine and I'm playing music and surfing with no appreciable difference. When I'm away from the PC the monitor is set to shut off after 5 minutes and no other activities are left open, so Rosetta/Boinc has full use of the resources.

Thanks again!
Dtallguy®
ID: 72211 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 72214 - Posted: 24 Jan 2012, 19:23:56 UTC

This sounds as if it might be the same issue that's cropped up before, see for example this thread.

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5815.

The workaround is to quit and restart BOINC.

It would be really nice if this problem got fixed, but its irreproducible nature must make it very hard to track down.

ID: 72214 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dtallguy®

Send message
Joined: 15 Jan 12
Posts: 3
Credit: 32,566,988
RAC: 0
Message 72216 - Posted: 24 Jan 2012, 23:52:34 UTC

One of my older machines (3.4G Single Core) was also showing goofy times (24.2 hours elapsed and 88.3 hours remaining with the percentage done - 21%) The actual CPU time was just over 2 hours. I stopped all processes and restarted Boinc and lo and behold the values got sorted out and everything appears normal now. So yes, this appears to be an issue with the Boinc client itself. I'll continue monitoring this and update if I find anything new.
ID: 72216 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,181,510
RAC: 3,269
Message 72219 - Posted: 25 Jan 2012, 11:44:02 UTC - in response to Message 72216.  

One of my older machines (3.4G Single Core) was also showing goofy times (24.2 hours elapsed and 88.3 hours remaining with the percentage done - 21%) The actual CPU time was just over 2 hours. I stopped all processes and restarted Boinc and lo and behold the values got sorted out and everything appears normal now. So yes, this appears to be an issue with the Boinc client itself. I'll continue monitoring this and update if I find anything new.


This is a LONG STANDING quirky issues that pops up so rarely the powers that be haven't been able to fix it yet. There are almost 2.5 million of us active crunchers, across MANY projects, so it SEEMS to happen a lot of but it really is rare. IF you can MAKE it happen, please say what you did, otherwise it is one of things that will get fixed one day, but the easy 'fix' right now is to exit Boinc and then restart it and all of a sudden it is okay again, as you noticed.
ID: 72219 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Work units get stuck in some CPU loop and take days....



©2024 University of Washington
https://www.bakerlab.org