Message boards : Number crunching : Three very long tasks running
Author | Message |
---|---|
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,760,173 RAC: 8,652 |
I just noticed 2 of my computers have long running Rosetta tasks, is this normal? They normally limit to 8 hours, sometimes 10. This machine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3792849 has two long tasks, one at 22 hours 14 minutes (CPU time, not wall time) https://boinc.bakerlab.org/rosetta/result.php?resultid=1220450326, and one at 14 hours 50 minutes https://boinc.bakerlab.org/rosetta/result.php?resultid=1220685579. This machine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=4360598 has one long task, at 15 hours 40 minutes https://boinc.bakerlab.org/rosetta/result.php?resultid=1220618443. Should I abort them? Have they broken or are they meant to run that long? I notice they're all rgmjp tasks. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1679 Credit: 17,780,643 RAC: 22,820 |
I just noticed 2 of my computers have long running Rosetta tasks, is this normal?It can be. Sometimes it can take a long time for a model to complete, that's why they have the watchdog timer which is 10 hours. So if a Task runs for 11 hours longer than it's Target CPU time, then it is probably worth aborting. But until it's at least 10.5 hours over the Target CPU time (and that is CPU time, not Runtime which can be way, way longer- particularly if a system is busy doing other things as well, or people have "Use at most100 % of CPU time" set to anything less than 100%) i would just let it be. Grant Darwin NT |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,760,173 RAC: 8,652 |
I just noticed 2 of my computers have long running Rosetta tasks, is this normal?It can be. So as I'm on the defaults, is that target 8 hours for whole work unit, + 10 hours maximum per model that it started just before 8 hours = 19 hours? One of them is now at 23 hours 17 minutes. I'm not sure how this thing works. How many models are usually run in one work unit? I ask because it's almost always very close to 8 hours they finish at, which would indicate they run a large number of short models, or there would be a wider variance of run times. Also, can I check somehow with the running task what it's currently doing? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1679 Credit: 17,780,643 RAC: 22,820 |
So as I'm on the defaults, is that target 8 hours for whole work unit, + 10 hours maximum per model that it started just before 8 hours = 19 hours? One of them is now at 23 hours 17 minutes.It is Target CPU time (as Runtime) + 10 hours (but i don't know if that is CPU time or Runtime). However, as is the case with one of your systems- if a system is busy doing other things, an 8 hour Task can take 10 hours to process (i've seen systems where it can take 24hrs to do 8 hours worth of work). The 10 hour watchdog timer i'm not so sure about- if it is 10 Hours Runtime, or 10 hours CPU time (maybe Modsense can fill us in?). If it were 10 hours Runtime, your 8 hour Target CPU time Tasks would end after 20 hours (because it takes 10 hours Runtime to do the 8 hours of CPU work, plus the extra 10 hours). If it is 10 hours CPU time, then i would expect it to take around 23 hours (once again, because it takes 10 hours to do 8 hours of CPU work, and roughly 12.5hrs to do the extra 10 hours of CPU work). If it's still going after 26hours i'd say it's well and truly gone beyond it's extended cut off time. As long as it's progress keeps increasing towards 100%, then it's probably still doing useful work. If it's no longer increasing (and/or the Estimated time keeps growing) then it's probably not actually doing anything usefull. Also, can I check somehow with the running task what it's currently doing?In the BOINC Manager, Advanced view, Tasks tab, select the Task in question, Then on the right in the command list, select Properties. Grant Darwin NT |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,836 |
Are you sure it's 10 hours? I could have sworn it was 4 hours. I posted this a while back and Mod.Sense seemed to agree. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,760,173 RAC: 8,652 |
It is Target CPU time (as Runtime) + 10 hours (but i don't know if that is CPU time or Runtime). The case with one of my systems? What do you mean? Both the systems in question have a CPU time close to the runtime. The tasks are 24 hours vs 25.5 hours, 17 hours, vs 18.5 hours, 16.5 hours vs 18 hours. If it were 10 hours Runtime, your 8 hour Target CPU time Tasks would end after 20 hours (because it takes 10 hours Runtime to do the 8 hours of CPU work, plus the extra 10 hours). If it is 10 hours CPU time, then i would expect it to take around 23 hours (once again, because it takes 10 hours to do 8 hours of CPU work, and roughly 12.5hrs to do the extra 10 hours of CPU work). I'm only looking at CPU time. Does the Boinc manager even show this? I'm using Boinctasks, which puts the CPU time in brackets next to the runtime. This is handy for CPU tasks to see if it's getting the whole core, it's handy for multi-core tasks to see how many cores it's actually making use of, and it's handy for GPU tasks to see if the CPU is slowing the GPU down. If it's still going after 26hours i'd say it's well and truly gone beyond it's extended cut off time. There's too many unknowns here. I'll just watch them and if either the progress stops (it's trickling forwards at the moment, they're at 99.308%, 99.033%, and 98.989%) or the deadline is exceeded, then I'll cancel them. I've got 66 cores altogether, 3 stuck isn't the end of the world. In the BOINC Manager, Advanced view, Tasks tab, select the Task in question, Then on the right in the command list, select Properties. Do you not know your right from your left? Or can that be swapped over? Maybe in other countries it matches the driving side! Oh, my mistake, it's your right and my left, you're the other side of the screen. And that doesn't give me any information at all. I wanted to know what model it was running, when it last changed model, etc. I've seen that in LHC, but they're running in a Linux virtual box so you can actually see the program running and putting up some information as it progresses. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1679 Credit: 17,780,643 RAC: 22,820 |
Yep, but wht i'm thinking and what i'm typing aren't always the same thing.In the BOINC Manager, Advanced view, Tasks tab, select the Task in question, Then on the right in the command list, select Properties.Do you not know your right from your left? The case with one of my systems? What do you mean?https://boinc.bakerlab.org/rosetta/result.php?resultid=1220380852 Run time 10 hours 4 min 10 sec CPU time 7 hours 57 min 55 sec Just over 10 hours to do just under 8 hours of work. Grant Darwin NT |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,760,173 RAC: 8,652 |
Yep, but wht i'm thinking and what i'm typing aren't always the same thing. My mother doesn't know which is which. She'll say it's in the cupboard on the right. If I take a long time she shouts "the other right!" I don't know if this is unusual, but I think aloud (well not really aloud, but verbally in my head). So I work out what I'm going to type, it's translated into the sounds of the words, then typed. Hence I always type the wrong their/there/they're and have to check on proofreading. https://boinc.bakerlab.org/rosetta/result.php?resultid=1220380852 Not much difference when we're deciding when to cancel things, as I'd leave it a bit longer anyway. The reason for that is I use Tthrottle to stop things overheating. No matter how big the fan, things still get too hot, and that's in Scotland! And the example you quoted is more extreme, because that's the machine I use, which means I want SILENCE! All the fans are limited to 50%, then it throttles after that. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
Are you sure it's 10 hours? It was 4hrs. Then they created some tasks in April that needed to run a very long time to complete the first decoy and upped the watchdog to 10hrs, using stacks of RAM, then they stopped the high RAM tasks to approach things a different way, I asked if the 10hrs setting was still appropriate and was told it was, then they did some work on allowing more frequent checkpoints, which seemed to solve task over-runs, and now they seem to have come back in a different form. So, your guess is as good as mine. tl;dr - no-one knows |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,836 |
Thanks for the reply! |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,760,173 RAC: 8,652 |
I just noticed 2 of my computers have long running Rosetta tasks, is this normal? They normally limit to 8 hours, sometimes 10. ARGH!!! F***ing windows update! I want to physically strangle the absolute moron at Microsoft who does this. The 1st machine above rebooted without my permission overnight for yet another bug fix to sloppy Windows 10 coding, and one of the tasks has gone back to the beginning: https://boinc.bakerlab.org/rosetta/result.php?resultid=1220685579 It's now showing 7 hours 36 minutes CPU time. Strangely the other one https://boinc.bakerlab.org/rosetta/result.php?resultid=1220450326 is still going, and is now at 1 day 9 hours 40 minutes. And so is the one on the other machine, which also rebooted, but I had to go press F1 because of Dell's moronic moaning about one of my RAM chips being suboptimal. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,760,173 RAC: 8,652 |
I just noticed 2 of my computers have long running Rosetta tasks, is this normal? They normally limit to 8 hours, sometimes 10. While I was out swimming, the one that had restarted completed in 11.8 hours CPU time, which is quicker than it was showing before the reboot. The other two are still plodding away and slowly increasing the percentage done (99.570% and 99.364%). Curiouser and curiouser. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
All but one of my machines are set to 24 hour task times. Had one run a task for 32 hours, and it only completed 2 decoys. Guessing this is one of those monster tasks mentioned above. /edit. Here is the task in question: https://boinc.bakerlab.org/rosetta/result.php?resultid=1220531809 |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,760,173 RAC: 8,652 |
All but one of my machines are set to 24 hour task times. Had one run a task for 32 hours, and it only completed 2 decoys. Guessing this is one of those monster tasks mentioned above. All of my three finished in 2 days 2 hours CPU time. One of them states less than that, but it seemed to do half of it, then the other half after the windows update reboot [1] without acknowledging how much time it had already done, although it must have saved something, because the second half took less time than the first half. All three did only 1 decoy. So either very big decoys, or your computer was faster. I can't see which computer did the task you mentioned, as that task is no longer listed on the server. But it looks like you have a bunch of xeons similar to my four X5650s. I do love watching 24 tasks running per machine. They're not very efficient with electricity, but they were only £7 a chip! [1] I've done this to hopefully stop it in the future: https://www.windowscentral.com/how-prevent-windows-10-rebooting-after-installing-updates |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
All but one of my machines are set to 24 hour task times. Had one run a task for 32 hours, and it only completed 2 decoys. Guessing this is one of those monster tasks mentioned above. I can't remember which of my three Xeon 12c/24t boxes got the task. 1 of them (OSX) is my only 8hr WU box, the other 2 are 24hr WU's, one Win10, the other OSX. (X5690, X5675, and X5670). I think it was my 2.93 box (the X5670, OSX) that got the long unit. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
I appear to have another one. As I'm typing it hasn't finished yet but currently it stands at just over 17 hours crunch time. The machine it's crunching on is set to 8hr WU's. Either that or it's a bad WU and something's up. /edit. It's at 99.028% complete. Every so often it increases .001% but seems to have stalled. It does have the same naming convention as OP's long WU, so maybe it's the same deal. I'll leave it running and see what happens. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
Update, the above task finished after almost 18 hours, waaaaay over. 1 decoy produced. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,760,173 RAC: 8,652 |
Update, the above task finished after almost 18 hours, waaaaay over. 1 decoy produced. All mine completed. Just leave long ones running, nobody's has failed (or gone over the 3 day deadline) yet. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Sid wrote: no-one knowsTasks are delivered with a command-line option -boinc::cpu_run_timeout 36000which suggests it’s 10 hours |
Message boards :
Number crunching :
Three very long tasks running
©2024 University of Washington
https://www.bakerlab.org