Minirosetta 3.62-3.65

Author	Message
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 79225 - Posted: 15 Dec 2015, 19:28:15 UTC - in response to Message 79214. Last modified: 15 Dec 2015, 19:31:26 UTC there are various/many limitations with boinc i'd guess in part due to the protocol design. For most part it works well, then in the real world we have the extremes which fall our of the 'normal' design ranges of boinc i'd guess. i read that boinc is based on a 'one way polling' design where all network requests are initiated by the client, this limits 'push' notifications from being possible as a solution. No need to "push" anything, the BOINC server has the ability to tell such things to the client simply on the next scheduler request. It's even possible to choose, if a work unit should be aborted even if it already started or only if it has not started yet. So nothing "extreme", just something that's been implemented for long time ago and already in use on other projects when needed. Of course I can't say, if the ancient version of BOINC Rosetta is using is able to send such messages, but if not, than that's just one more point on the long long list of reasons why they need to upgrade. . ID: 79225 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 79234 - Posted: 16 Dec 2015, 23:21:35 UTC - in response to Message 79225. there are various/many limitations with boinc i'd guess in part due to the protocol design. For most part it works well, then in the real world we have the extremes which fall our of the 'normal' design ranges of boinc i'd guess. i read that boinc is based on a 'one way polling' design where all network requests are initiated by the client, this limits 'push' notifications from being possible as a solution. No need to "push" anything, the BOINC server has the ability to tell such things to the client simply on the next scheduler request. It's even possible to choose, if a work unit should be aborted even if it already started or only if it has not started yet. So nothing "extreme", just something that's been implemented for long time ago and already in use on other projects when needed. Of course I can't say, if the ancient version of BOINC Rosetta is using is able to send such messages, but if not, than that's just one more point on the long long list of reasons why they need to upgrade. I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something. ID: 79234 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 79237 - Posted: 17 Dec 2015, 18:25:48 UTC - in response to Message 79234. I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something. There's also the possibility to abort WUs, which has already been sent out to a client on the next scheduler request of this client. No extra communication. WUs like that appear than as "Aborted by server" (or project, not sure) in the task list. . ID: 79237 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 79241 - Posted: 17 Dec 2015, 21:46:44 UTC - in response to Message 79237. I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something. There's also the possibility to abort WUs, which has already been sent out to a client on the next scheduler request of this client. No extra communication. WUs like that appear than as "Aborted by server" (or project, not sure) in the task list. I'm not aware of an abort option other than canceling jobs. ID: 79241 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 8,196	Message 79242 - Posted: 18 Dec 2015, 3:40:50 UTC - in response to Message 79241. I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something. There's also the possibility to abort WUs, which has already been sent out to a client on the next scheduler request of this client. No extra communication. WUs like that appear than as "Aborted by server" (or project, not sure) in the task list. I'm not aware of an abort option other than canceling jobs. I think this logic is expected to be in the client. EXIT_UNSTARTED_LATE 200 Task was aborted due to it having not started and already past the deadline. http://boincfaq.mundayweb.com/index.php?viewCat=3&sessionID=c5a9905b2172d67bb1c1ff12eedd0b6c ID: 79242 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0	Message 79243 - Posted: 18 Dec 2015, 4:10:40 UTC - in response to Message 79242. Last modified: 18 Dec 2015, 4:17:40 UTC I'm not aware of an abort option other than canceling jobs. I think this logic is expected to be in the client. EXIT_UNSTARTED_LATE 200 Task was aborted due to it having not started and already past the deadline. http://boincfaq.mundayweb.com/index.php?viewCat=3&sessionID=c5a9905b2172d67bb1c1ff12eedd0b6c ... Definitely, some sort of 'remote abort' command that can be issued from the server to all clients holding a certain job would be ideal.. in the link rjs5 shared above, there's one code EXIT_ABORTED_BY_PROJECT 202 that sounds like it could be it... but it still doesn't seem to be something that 'pulls the plug' remotely, rather it's just a classification of how to handle WUs returned for a job that was cancelled by the server (said Work Units still run their full course from the look of it, sadly) - more reading found here: http://boinc.berkeley.edu/dev/forum_thread.php?id=7704&sort=5 - I also took a look around the BOINC dev docs and didn't see anything like this, which is sad as it sounds like a very practical and useful function. I'll continue looking though... Ideally, would want some way to actually tell clients to stop crunching a certain task and bail out early (move on to the next useful task) rather than having their cycles spinning for something that is cancelled. ID: 79243 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 8,196	Message 79247 - Posted: 18 Dec 2015, 16:29:25 UTC - in response to Message 79243. I'm not aware of an abort option other than canceling jobs. I think this logic is expected to be in the client. EXIT_UNSTARTED_LATE 200 Task was aborted due to it having not started and already past the deadline. http://boincfaq.mundayweb.com/index.php?viewCat=3&sessionID=c5a9905b2172d67bb1c1ff12eedd0b6c ... Definitely, some sort of 'remote abort' command that can be issued from the server to all clients holding a certain job would be ideal.. in the link rjs5 shared above, there's one code EXIT_ABORTED_BY_PROJECT 202 that sounds like it could be it... but it still doesn't seem to be something that 'pulls the plug' remotely, rather it's just a classification of how to handle WUs returned for a job that was cancelled by the server (said Work Units still run their full course from the look of it, sadly) - more reading found here: http://boinc.berkeley.edu/dev/forum_thread.php?id=7704&sort=5 - I also took a look around the BOINC dev docs and didn't see anything like this, which is sad as it sounds like a very practical and useful function. I'll continue looking though... Ideally, would want some way to actually tell clients to stop crunching a certain task and bail out early (move on to the next useful task) rather than having their cycles spinning for something that is cancelled. Just pass the DEADLINE date to the client job as a command line parameter and have the client periodically check the current date against the deadline. When the deadline is passed, abort. It would be pretty easy to add the "if ( date > deadline ) abort;" to the code to determine whether to continue or not. I would probably place it just before the checkpointing code so the client could skip the checkpointing if deadline is passed. ID: 79247 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79249 - Posted: 18 Dec 2015, 17:47:34 UTC - in response to Message 79247. Just pass the DEADLINE date to the client job as a command line parameter and have the client periodically check the current date against the deadline. When the deadline is passed, abort. It would be pretty easy to add the "if ( date > deadline ) abort;" to the code to determine whether to continue or not. I would probably place it just before the checkpointing code so the client could skip the checkpointing if deadline is passed. actually i'm wondering a little if it may be fun to have a utility that works like ifttt https://ifttt.com/ or tasker http://www.androidcentral.com/tasker-review-thing-you-need-do-all-things for boinc, then one can literally create all kind of wiz bang scheduling & mini automation that one prefers :D lol ID: 79249 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 79253 - Posted: 19 Dec 2015, 23:37:19 UTC So if anyone is aware of an actual abort option to push that info onto clients that may be holding or running a job that we'd like to abort in a timely manner, please let me know. A possible option that I am aware of is using a trickle message and coding it into our application and of course we'd also need to code up the server side logic. But I'm not sure there's an abort option that pushes that info out to clients currently in BOINC. ID: 79253 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79255 - Posted: 20 Dec 2015, 3:41:52 UTC - in response to Message 79253. Last modified: 20 Dec 2015, 3:47:18 UTC So if anyone is aware of an actual abort option to push that info onto clients that may be holding or running a job that we'd like to abort in a timely manner, please let me know. A possible option that I am aware of is using a trickle message and coding it into our application and of course we'd also need to code up the server side logic. But I'm not sure there's an abort option that pushes that info out to clients currently in BOINC. yeah, it seemed trickle messages is possibly an 'only' way and it'd seem there would need to be enhancements in both the server codes as well as r@h app codes http://boinc.berkeley.edu/trac/wiki/TrickleApi e.g. when r@h app receives a trickle message, interprets it for an 'early end' command and it wraps up the job and submits that to the server as a completed task i'm not too sure if there could be client dependencies, e.g. that certain client versions may have different or don't have the trickle feature ID: 79255 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982	Message 79317 - Posted: 28 Dec 2015, 5:26:05 UTC Bit of a weird error in this task. It validated ok and only finished because it reached 99 decoys but... 14h2ld2203_fold_and_dock_SAVE_ALL_OUT_319506_998_0 bad torsion type for JumpAtom: 1 ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94 bad torsion type for JumpAtom: 1 ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94 bad torsion type for JumpAtom: 1 ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94 -> -multiple lines of the above ... ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6 ERROR:: Exit from: ......srccoreposesymmetryutil.cc line: 894 No heartbeat from core client for 30 sec - exiting ... ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6 ERROR:: Exit from: ......srccoreposesymmetryutil.cc line: 894 bad torsion type for JumpAtom: 1 ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94 bad torsion type for JumpAtom: 1 ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94 bad torsion type for JumpAtom: 1 ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94 -> -multiple lines of the above again ID: 79317 · Rating: 0 · rate: / Reply Quote

sinspin Send message Joined: 30 Jan 06 Posts: 29 Credit: 6,574,585 RAC: 0	Message 79367 - Posted: 7 Jan 2016, 14:42:54 UTC I got a lot of errors with long running krypton WUs. Seems that all of them stop with the same error: exceeded elapsed time limit 141525.53 (500000.00G/3.53G) some examples: https://boinc.bakerlab.org/rosetta/result.php?resultid=781463729 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463704 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463682 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463677 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463675 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463670 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463662 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463660 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463658 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463656 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463652 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463647 I am not sure if other WUs wich run up to the time limit are successful or not. ID: 79367 · Rating: 0 · rate: / Reply Quote

krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0	Message 79373 - Posted: 7 Jan 2016, 19:37:06 UTC Sorry about that! I'm investigating. ID: 79373 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 79374 - Posted: 7 Jan 2016, 20:20:11 UTC I've increased our default fpops limit so hopefully this will prevent such errors in the future but we can't update the limit for the jobs that have already been submitted. Thanks for the heads up on this error. ID: 79374 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 79376 - Posted: 7 Jan 2016, 21:56:26 UTC - in response to Message 79243. ... Definitely, some sort of 'remote abort' command that can be issued from the server to all clients holding a certain job would be ideal.. in the link rjs5 shared above, there's one code EXIT_ABORTED_BY_PROJECT 202 that sounds like it could be it... but it still doesn't seem to be something that 'pulls the plug' remotely, rather it's just a classification of how to handle WUs returned for a job that was cancelled by the server (said Work Units still run their full course from the look of it, sadly) - more reading found here: http://boinc.berkeley.edu/dev/forum_thread.php?id=7704&sort=5 - I also took a look around the BOINC dev docs and didn't see anything like this, which is sad as it sounds like a very practical and useful function. I'll continue looking though... Ideally, would want some way to actually tell clients to stop crunching a certain task and bail out early (move on to the next useful task) rather than having their cycles spinning for something that is cancelled. The Abort (202) command is used all the time on World Community Grid to cancel jobs in your buffer before they start, both on Clean Energy Phase 2 (CEP2) and Mapping Cancer Markers (MCM). I have several in my logs now, and see them all the time. I am not a software developer and don't know how it is implemented, but it works something like this: In those projects where a quorum of two is required to validate results, the two work units are sent out simultaneously to two different users under a given time limit (say 7 days). Suppose the first one comes back in a couple of days, but the second one is delayed. After 5 days or so the server sends out another copy to a "trusted computer" (or whatever it is called). I am one of them, because I leave my machines on 24/7 and have a low error rate. So it sits in my buffer for a few hours, but before I can get to it, the second machine returns its results. Therefore, they don't need me to work on it, and they send me the "Server Abort" command for that work unit. There are plenty of people on WCG who can explain it in detail; try SekeRob first. ID: 79376 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 79381 - Posted: 8 Jan 2016, 19:35:21 UTC This sounds like a scheduler function for redundant jobs which makes sense but this is not a general abort option that we can use without having to modify server and/or application code to my knowledge. ID: 79381 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 79382 - Posted: 8 Jan 2016, 20:31:28 UTC I believe it is a check for host-specific messages during the scheduler requests. Big hit to database. See <msg_to_host/> tag https://boinc.berkeley.edu/trac/wiki/ProjectOptions Rosetta Moderator: Mod.Sense ID: 79382 · Rating: 0 · rate: / Reply Quote

sinspin Send message Joined: 30 Jan 06 Posts: 29 Credit: 6,574,585 RAC: 0	Message 79384 - Posted: 9 Jan 2016, 9:56:16 UTC - in response to Message 79374. I've increased our default fpops limit so hopefully this will prevent such errors in the future but we can't update the limit for the jobs that have already been submitted. Thanks for the heads up on this error. Thanks. The "500000G" in the error message, what means that? How can i calculate the amount of Seconds, the task can run, from this value? Is that around 39Hrs? If yes, why? My task runtime is set to two days. (that saves a lot of network traffic and increase the crunching efficiency) ID: 79384 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0	Message 79385 - Posted: 9 Jan 2016, 17:53:52 UTC - in response to Message 79384. My task runtime is set to two days. (that saves a lot of network traffic and increase the crunching efficiency) .. Well there might be the underlying issue, the 2 day runtime was deprecated sometime last year after some type of issue with it (I forget the details at this point, but it wasn't the error your seeing but another one) with certain protocols or job types. I suggest scaling back your target runtime to one of the currently available values (max is 1 day). Cheers! ID: 79385 · Rating: 0 · rate: / Reply Quote

sinspin Send message Joined: 30 Jan 06 Posts: 29 Credit: 6,574,585 RAC: 0	Message 79386 - Posted: 9 Jan 2016, 18:49:02 UTC I can remember about the other runtime bug. The validation at server side was failing for some kind of WUs when the WU was finished. The current bug happens only on my new (very fast) Machine. My other machine runs well with all kind of tasks (two days runtime). I use for the new machine a differend Rosetta@home preferences set, wich have now 1 day runtime. Will see if that solve the problem. ID: 79386 · Rating: 0 · rate: / Reply Quote