long/large work units, cpu_run_time limit and how to check 'progress'?

Author	Message
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 90614 - Posted: 5 Apr 2019, 0:33:44 UTC Last modified: 5 Apr 2019, 0:57:31 UTC i kind of started crunching rosetta@home again received a set of Rosetta 4.07 jobs rb_04_03_2501_2629_ab_1000_robetta_cstwt_5.0* these seemed to be somewhat bigger more complex proteins when i tried the 'show graphics' what i'm somewhat surprised is that i used to set a run time limit cpu_run_time of 4 hours but these WUs has run beyond that normal 4 hours i came to expect (and more below, there is apparently no results - no decoy found in that 4 hour run) from the 'show graphics' , the lowest in for the 'low energy' is some less than -200, but i'm not too sure if that isn't 'low enough'. and it keep bouncing up to try other conformations which has higher energy is that cpu_run_time limit still in effect and used anywhere ? (oops, looked it up in the online preferences page, Target CPU run time is still 4 hours, so i'd guess it is still used? another thing would be that is there a way i can check the 'progress'? i tried going into the slots/n/ directory and looking at the stderr and as apparent it seem that in stderr and stdout i did not find any 'decoys' (models) messages being listed there and the jobs keep running. is the stderr or stdout the correct file to find out if any 'decoys' (models) has been found for the WU (in particular while it is running)? limiting the continuous run time for the wu is necessary as i'd normally switch off the pc after that and mind i normally let the jobs run in the night as room temperatures are cooler and it runs as i sleep so that it has as much uninterrupted cpu usage as is possible to complete the WU i'd try to suspend the jobs as they have run for some 5 hours beyond the cpu_run_time of 4 hours and apparently there is no decoys found yet if stderr is the correct file to check. hopefully, the data is still in the checkpoint and that the wu can continue later. note that for long running WUs i'm ok for it to be 'suspended' and let them continue from that point say the next night, that would in a way allow more 'difficult' WU to complete ID: 90614 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 90617 - Posted: 5 Apr 2019, 9:08:13 UTC Last modified: 5 Apr 2019, 9:16:38 UTC ok finally it completes after 5 hours of run time, 6 hours elapsed, suspended once in between, no fanfare https://boinc.bakerlab.org/rosetta/result.php?resultid=1066394376 a single decoy in that 5 hours ID: 90617 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2190 Credit: 13,720,774 RAC: 2,131	Message 90618 - Posted: 5 Apr 2019, 9:44:37 UTC - in response to Message 90617. Last modified: 5 Apr 2019, 9:44:55 UTC ok finally it completes after 5 hours of run time, 6 hours elapsed, suspended once in between, no fanfare https://boinc.bakerlab.org/rosetta/result.php?resultid=1066394376 a single decoy in that 5 hours Same here on my Xeon Runtime 2hs, 6hs of calculation, 1 decoy These are big proteins, i think ID: 90618 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 90619 - Posted: 5 Apr 2019, 10:00:17 UTC - in response to Message 90618. i've been wary of suspending jobs, concerned that it may not checkpoint adequately and continue from that point. it is good that suspending them did not cause any visible harm and i can continue the long jobs at a separate sitting that may allow me to use a higher cpu_run_time so that i can crunch the bigger jobs as well just like this batch ID: 90619 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 90621 - Posted: 5 Apr 2019, 15:48:47 UTC The graphics is the simplest way to see how many decoys a given WU has produced. Also, from the properties of the WU display, you can see the number of CPU seconds currently, and the number at the time of the last checkpoint. If the PC is powered down or the task is removed from memory to run another BOINC task, the work will resume at the checkpoint (if any). If no checkpoint has been taken yet, work will be restarted from the beginning. The goal is to checkpoint every 10-15 minutes. But new protocols and large proteins often start by going longer between checkpoints. If the protocol proves itself useful, then it is generally enhanced to do more frequent checkpointing. Rosetta Moderator: Mod.Sense ID: 90621 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 90622 - Posted: 5 Apr 2019, 17:30:19 UTC - in response to Message 90621. Last modified: 5 Apr 2019, 17:35:30 UTC thanks! :) i didn't realise the graphics show the number of models (decoys) found :) it would be good indeed if larger proteins checkpoint at longer interval as the files are literally rather large and i'd think more so for large molecules. the compromise of course is that if the task is suspended or for that matter the pc is shutdown, more is lost between the checkpoints so it would take longer to resume the job till finish. but i think for administrative suspend on the panel, it would seem that would kind of trigger a checkpoint, i'm not too sure if it does, but i'd think it should boinc preferences apparently has a checkpoint at most interval preferences parameter which i set a preference of 2 minutes, too close between intervals may see a lot of 'disk trashing' , harddisks are still pretty much a norm despite that ssd is gaining popularity. hence, users should be able to influence it with the parameter as well but the checkpoint proves useful as in that set of tasks i suspended them on the panel and restarted them today and they complete without issues. it is good as this alleviates the concern that long running jobs lose all that work after running for hours, and i'd be able to crunch bigger jobs which may take more than a single 'sitting' (continuous run interval) ID: 90622 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 90623 - Posted: 5 Apr 2019, 18:05:07 UTC - in response to Message 90622. off-topic: it seemed the fact that large molecules / proteins which has much more folding permutations and is much harder to perform an appropriate fold. it may point to the natural cause of diseases due to protein misfolding e.g. alzheimer and cancer misfolded proteins cause alzheimer or cancer? ID: 90623 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 90640 - Posted: 8 Apr 2019, 18:57:26 UTC - in response to Message 90622. ... i think for administrative suspend on the panel, it would seem that would kind of trigger a checkpoint, i'm not too sure if it does, but i'd think it should Unfortunately, it does not work that way. The task has to reach a point where it can completely store and reload itself. So, it is not possible to call something in the task and command it to take a checkpoint now. Rosetta Moderator: Mod.Sense ID: 90640 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2190 Credit: 13,720,774 RAC: 2,131	Message 90660 - Posted: 12 Apr 2019, 8:50:31 UTC - in response to Message 90640. Unfortunately, it does not work that way. The task has to reach a point where it can completely store and reload itself. So, it is not possible to call something in the task and command it to take a checkpoint now. And, after a reboot, all my "_robetta_cstwt_5.0*" restart from 0% 5hs of crunching lost... ID: 90660 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 90662 - Posted: 12 Apr 2019, 14:29:50 UTC - in response to Message 90660. Last modified: 12 Apr 2019, 14:34:01 UTC Unfortunately, it does not work that way. The task has to reach a point where it can completely store and reload itself. So, it is not possible to call something in the task and command it to take a checkpoint now. And, after a reboot, all my "_robetta_cstwt_5.0*" restart from 0% 5hs of crunching lost... next time try to do a full proper suspend for all the tasks before you shutdown. that may make a difference i'm not sure why but for that batch, perhaps i'm lucky, i'm able to continue from that point forwards after restarting perhaps it isn't quite possible for all wu but for a fraction of it it works. r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs ID: 90662 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 90663 - Posted: 12 Apr 2019, 15:01:10 UTC - in response to Message 90662. r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs I run my machines 24/7, and it would not be much of a problem. So if they are going to produce a large number of such workunits, they could set up a separate queue, and allow the users to select it. ID: 90663 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2190 Credit: 13,720,774 RAC: 2,131	Message 90664 - Posted: 12 Apr 2019, 19:33:51 UTC - in response to Message 90662. next time try to do a full proper suspend for all the tasks before you shutdown. that may make a difference i'm not sure why but for that batch, perhaps i'm lucky, i'm able to continue from that point forwards after restarting Nope. Restarting from 0% after pause and reboot :-( ID: 90664 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 90665 - Posted: 12 Apr 2019, 20:19:24 UTC - in response to Message 90662. r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs In early development of a new method of analysis of a large protein, it is pretty common to span long periods of time without checkpoints. If the new method proves useful, and yields better models, then further development is done to improve runtime per model and checkpointing. Rosetta Moderator: Mod.Sense ID: 90665 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 90671 - Posted: 14 Apr 2019, 17:13:03 UTC - in response to Message 90665. r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs In early development of a new method of analysis of a large protein, it is pretty common to span long periods of time without checkpoints. If the new method proves useful, and yields better models, then further development is done to improve runtime per model and checkpointing. Why isn't this new and possibly disruptive work done on RALPH? Seems like RALPH is the place where Rosetta experimentation takes place and not on the main Rosetta@home. The RALPH volunteers are expecting this and it does not disrupt those who don't want to be messed up. Other projects that perform their development work on their main site have the option for crunchers to opt-out of this testing. ID: 90671 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2190 Credit: 13,720,774 RAC: 2,131	Message 90672 - Posted: 14 Apr 2019, 20:34:43 UTC - in response to Message 90671. Why isn't this new and possibly disruptive work done on RALPH? Seems like RALPH is the place where Rosetta experimentation takes place and not on the main Rosetta@home. The RALPH volunteers are expecting this and it does not disrupt those who don't want to be messed up. +1. I crunch on both Ralph and Rosetta. When i crunch on Ralph i have no problems with crash, errors, etc. It's normal in beta test. When i crunch on Rosetta i would like stability and no errors. ID: 90672 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 90674 - Posted: 15 Apr 2019, 6:35:48 UTC - in response to Message 90672. Why isn't this new and possibly disruptive work done on RALPH? Seems like RALPH is the place where Rosetta experimentation takes place and not on the main Rosetta@home. The RALPH volunteers are expecting this and it does not disrupt those who don't want to be messed up. +1. I crunch on both Ralph and Rosetta. When i crunch on Ralph i have no problems with crash, errors, etc. It's normal in beta test. When i crunch on Rosetta i would like stability and no errors. errors can be the results themselves, e.g. if a researcher generates lots of aribtrary models and maybe only 1 in 1,000,000 is a model (protein) that would assemble and run to completion, all 999,999 would run to failure error and that last 1 in 1,000,000 runs to completion lol the extreme of which i'd think some proteins may be completely synthetic, i.e. not seen in nature ID: 90674 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2190 Credit: 13,720,774 RAC: 2,131	Message 90675 - Posted: 15 Apr 2019, 9:09:59 UTC - in response to Message 90674. errors can be the results themselves, e.g. if a researcher generates lots of aribtrary models and maybe only 1 in 1,000,000 is a model (protein) that would assemble and run to completion, all 999,999 would run to failure error and that last 1 in 1,000,000 runs to completion Errors are results in test projects (like Ralph), cause debugging is welcome. I'm thinking about technical error like "validation error", "c++(out of memory) error", etc, in production projects, like Rosetta. ID: 90675 · Rating: 0 · rate: / Reply Quote