Minirosetta 3.62-3.65

Message boards : Number crunching : Minirosetta 3.62-3.65

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 79225 - Posted: 15 Dec 2015, 19:28:15 UTC - in response to Message 79214.  
Last modified: 15 Dec 2015, 19:31:26 UTC

there are various/many limitations with boinc i'd guess in part due to the protocol design. For most part it works well, then in the real world we have the extremes which fall our of the 'normal' design ranges of boinc i'd guess. i read that boinc is based on a 'one way polling' design where all network requests are initiated by the client, this limits 'push' notifications from being possible as a solution.

No need to "push" anything, the BOINC server has the ability to tell such things to the client simply on the next scheduler request. It's even possible to choose, if a work unit should be aborted even if it already started or only if it has not started yet. So nothing "extreme", just something that's been implemented for long time ago and already in use on other projects when needed.

Of course I can't say, if the ancient version of BOINC Rosetta is using is able to send such messages, but if not, than that's just one more point on the long long list of reasons why they need to upgrade.
.
ID: 79225 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79234 - Posted: 16 Dec 2015, 23:21:35 UTC - in response to Message 79225.  

there are various/many limitations with boinc i'd guess in part due to the protocol design. For most part it works well, then in the real world we have the extremes which fall our of the 'normal' design ranges of boinc i'd guess. i read that boinc is based on a 'one way polling' design where all network requests are initiated by the client, this limits 'push' notifications from being possible as a solution.

No need to "push" anything, the BOINC server has the ability to tell such things to the client simply on the next scheduler request. It's even possible to choose, if a work unit should be aborted even if it already started or only if it has not started yet. So nothing "extreme", just something that's been implemented for long time ago and already in use on other projects when needed.

Of course I can't say, if the ancient version of BOINC Rosetta is using is able to send such messages, but if not, than that's just one more point on the long long list of reasons why they need to upgrade.


I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something.
ID: 79234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 79237 - Posted: 17 Dec 2015, 18:25:48 UTC - in response to Message 79234.  

I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something.

There's also the possibility to abort WUs, which has already been sent out to a client on the next scheduler request of this client. No extra communication. WUs like that appear than as "Aborted by server" (or project, not sure) in the task list.
.
ID: 79237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79241 - Posted: 17 Dec 2015, 21:46:44 UTC - in response to Message 79237.  

I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something.

There's also the possibility to abort WUs, which has already been sent out to a client on the next scheduler request of this client. No extra communication. WUs like that appear than as "Aborted by server" (or project, not sure) in the task list.


I'm not aware of an abort option other than canceling jobs.
ID: 79241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 8,196
Message 79242 - Posted: 18 Dec 2015, 3:40:50 UTC - in response to Message 79241.  

I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something.

There's also the possibility to abort WUs, which has already been sent out to a client on the next scheduler request of this client. No extra communication. WUs like that appear than as "Aborted by server" (or project, not sure) in the task list.


I'm not aware of an abort option other than canceling jobs.


I think this logic is expected to be in the client.

EXIT_UNSTARTED_LATE 200
Task was aborted due to it having not started and already past the deadline.

http://boincfaq.mundayweb.com/index.php?viewCat=3&sessionID=c5a9905b2172d67bb1c1ff12eedd0b6c

ID: 79242 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 79243 - Posted: 18 Dec 2015, 4:10:40 UTC - in response to Message 79242.  
Last modified: 18 Dec 2015, 4:17:40 UTC


I'm not aware of an abort option other than canceling jobs.


I think this logic is expected to be in the client.

EXIT_UNSTARTED_LATE 200
Task was aborted due to it having not started and already past the deadline.

http://boincfaq.mundayweb.com/index.php?viewCat=3&sessionID=c5a9905b2172d67bb1c1ff12eedd0b6c


... Definitely, some sort of 'remote abort' command that can be issued from the server to all clients holding a certain job would be ideal.. in the link rjs5 shared above, there's one code EXIT_ABORTED_BY_PROJECT 202 that sounds like it could be it... but it still doesn't seem to be something that 'pulls the plug' remotely, rather it's just a classification of how to handle WUs returned for a job that was cancelled by the server (said Work Units still run their full course from the look of it, sadly) - more reading found here: http://boinc.berkeley.edu/dev/forum_thread.php?id=7704&sort=5 - I also took a look around the BOINC dev docs and didn't see anything like this, which is sad as it sounds like a very practical and useful function. I'll continue looking though...

Ideally, would want some way to actually tell clients to stop crunching a certain task and bail out early (move on to the next useful task) rather than having their cycles spinning for something that is cancelled.
ID: 79243 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 8,196
Message 79247 - Posted: 18 Dec 2015, 16:29:25 UTC - in response to Message 79243.  


I'm not aware of an abort option other than canceling jobs.


I think this logic is expected to be in the client.

EXIT_UNSTARTED_LATE 200
Task was aborted due to it having not started and already past the deadline.

http://boincfaq.mundayweb.com/index.php?viewCat=3&sessionID=c5a9905b2172d67bb1c1ff12eedd0b6c


... Definitely, some sort of 'remote abort' command that can be issued from the server to all clients holding a certain job would be ideal.. in the link rjs5 shared above, there's one code EXIT_ABORTED_BY_PROJECT 202 that sounds like it could be it... but it still doesn't seem to be something that 'pulls the plug' remotely, rather it's just a classification of how to handle WUs returned for a job that was cancelled by the server (said Work Units still run their full course from the look of it, sadly) - more reading found here: http://boinc.berkeley.edu/dev/forum_thread.php?id=7704&sort=5 - I also took a look around the BOINC dev docs and didn't see anything like this, which is sad as it sounds like a very practical and useful function. I'll continue looking though...

Ideally, would want some way to actually tell clients to stop crunching a certain task and bail out early (move on to the next useful task) rather than having their cycles spinning for something that is cancelled.




Just pass the DEADLINE date to the client job as a command line parameter and have the client periodically check the current date against the deadline. When the deadline is passed, abort.

It would be pretty easy to add the "if ( date > deadline ) abort;" to the code to determine whether to continue or not. I would probably place it just before the checkpointing code so the client could skip the checkpointing if deadline is passed.

ID: 79247 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 79249 - Posted: 18 Dec 2015, 17:47:34 UTC - in response to Message 79247.  




Just pass the DEADLINE date to the client job as a command line parameter and have the client periodically check the current date against the deadline. When the deadline is passed, abort.

It would be pretty easy to add the "if ( date > deadline ) abort;" to the code to determine whether to continue or not. I would probably place it just before the checkpointing code so the client could skip the checkpointing if deadline is passed.




actually i'm wondering a little if it may be fun to have a utility that works like ifttt https://ifttt.com/ or tasker http://www.androidcentral.com/tasker-review-thing-you-need-do-all-things for boinc, then one can literally create all kind of wiz bang scheduling & mini automation that one prefers :D lol
ID: 79249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79253 - Posted: 19 Dec 2015, 23:37:19 UTC

So if anyone is aware of an actual abort option to push that info onto clients that may be holding or running a job that we'd like to abort in a timely manner, please let me know. A possible option that I am aware of is using a trickle message and coding it into our application and of course we'd also need to code up the server side logic. But I'm not sure there's an abort option that pushes that info out to clients currently in BOINC.
ID: 79253 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 79255 - Posted: 20 Dec 2015, 3:41:52 UTC - in response to Message 79253.  
Last modified: 20 Dec 2015, 3:47:18 UTC

So if anyone is aware of an actual abort option to push that info onto clients that may be holding or running a job that we'd like to abort in a timely manner, please let me know. A possible option that I am aware of is using a trickle message and coding it into our application and of course we'd also need to code up the server side logic. But I'm not sure there's an abort option that pushes that info out to clients currently in BOINC.


yeah, it seemed trickle messages is possibly an 'only' way and it'd seem there would need to be enhancements in both the server codes as well as r@h app codes
http://boinc.berkeley.edu/trac/wiki/TrickleApi
e.g. when r@h app receives a trickle message, interprets it for an 'early end' command and it wraps up the job and submits that to the server as a completed task

i'm not too sure if there could be client dependencies, e.g. that certain client versions may have different or don't have the trickle feature
ID: 79255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 10,982
Message 79317 - Posted: 28 Dec 2015, 5:26:05 UTC

Bit of a weird error in this task. It validated ok and only finished because it reached 99 decoys but...

14h2ld2203_fold_and_dock_SAVE_ALL_OUT_319506_998_0

bad torsion type for JumpAtom: 1
ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94
bad torsion type for JumpAtom: 1
ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94
bad torsion type for JumpAtom: 1
ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94

-> -multiple lines of the above

...

ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: ......srccoreposesymmetryutil.cc line: 894
No heartbeat from core client for 30 sec - exiting

...

ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: ......srccoreposesymmetryutil.cc line: 894
bad torsion type for JumpAtom: 1
ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94
bad torsion type for JumpAtom: 1
ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94
bad torsion type for JumpAtom: 1
ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 94

-> -multiple lines of the above again


ID: 79317 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sinspin

Send message
Joined: 30 Jan 06
Posts: 29
Credit: 6,574,585
RAC: 0
Message 79367 - Posted: 7 Jan 2016, 14:42:54 UTC

ID: 79367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 79373 - Posted: 7 Jan 2016, 19:37:06 UTC

Sorry about that! I'm investigating.

ID: 79373 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79374 - Posted: 7 Jan 2016, 20:20:11 UTC

I've increased our default fpops limit so hopefully this will prevent such errors in the future but we can't update the limit for the jobs that have already been submitted. Thanks for the heads up on this error.
ID: 79374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79376 - Posted: 7 Jan 2016, 21:56:26 UTC - in response to Message 79243.  


... Definitely, some sort of 'remote abort' command that can be issued from the server to all clients holding a certain job would be ideal.. in the link rjs5 shared above, there's one code EXIT_ABORTED_BY_PROJECT 202 that sounds like it could be it... but it still doesn't seem to be something that 'pulls the plug' remotely, rather it's just a classification of how to handle WUs returned for a job that was cancelled by the server (said Work Units still run their full course from the look of it, sadly) - more reading found here: http://boinc.berkeley.edu/dev/forum_thread.php?id=7704&sort=5 - I also took a look around the BOINC dev docs and didn't see anything like this, which is sad as it sounds like a very practical and useful function. I'll continue looking though...

Ideally, would want some way to actually tell clients to stop crunching a certain task and bail out early (move on to the next useful task) rather than having their cycles spinning for something that is cancelled.


The Abort (202) command is used all the time on World Community Grid to cancel jobs in your buffer before they start, both on Clean Energy Phase 2 (CEP2) and Mapping Cancer Markers (MCM). I have several in my logs now, and see them all the time. I am not a software developer and don't know how it is implemented, but it works something like this: In those projects where a quorum of two is required to validate results, the two work units are sent out simultaneously to two different users under a given time limit (say 7 days). Suppose the first one comes back in a couple of days, but the second one is delayed. After 5 days or so the server sends out another copy to a "trusted computer" (or whatever it is called). I am one of them, because I leave my machines on 24/7 and have a low error rate. So it sits in my buffer for a few hours, but before I can get to it, the second machine returns its results. Therefore, they don't need me to work on it, and they send me the "Server Abort" command for that work unit.

There are plenty of people on WCG who can explain it in detail; try SekeRob first.

ID: 79376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79381 - Posted: 8 Jan 2016, 19:35:21 UTC

This sounds like a scheduler function for redundant jobs which makes sense but this is not a general abort option that we can use without having to modify server and/or application code to my knowledge.
ID: 79381 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 79382 - Posted: 8 Jan 2016, 20:31:28 UTC

I believe it is a check for host-specific messages during the scheduler requests. Big hit to database.

See <msg_to_host/> tag
https://boinc.berkeley.edu/trac/wiki/ProjectOptions
Rosetta Moderator: Mod.Sense
ID: 79382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sinspin

Send message
Joined: 30 Jan 06
Posts: 29
Credit: 6,574,585
RAC: 0
Message 79384 - Posted: 9 Jan 2016, 9:56:16 UTC - in response to Message 79374.  

I've increased our default fpops limit so hopefully this will prevent such errors in the future but we can't update the limit for the jobs that have already been submitted. Thanks for the heads up on this error.


Thanks.
The "500000G" in the error message, what means that? How can i calculate the amount of Seconds, the task can run, from this value?
Is that around 39Hrs? If yes, why? My task runtime is set to two days. (that saves a lot of network traffic and increase the crunching efficiency)
ID: 79384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 79385 - Posted: 9 Jan 2016, 17:53:52 UTC - in response to Message 79384.  

My task runtime is set to two days. (that saves a lot of network traffic and increase the crunching efficiency)


.. Well there might be the underlying issue, the 2 day runtime was deprecated sometime last year after some type of issue with it (I forget the details at this point, but it wasn't the error your seeing but another one) with certain protocols or job types. I suggest scaling back your target runtime to one of the currently available values (max is 1 day). Cheers!
ID: 79385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sinspin

Send message
Joined: 30 Jan 06
Posts: 29
Credit: 6,574,585
RAC: 0
Message 79386 - Posted: 9 Jan 2016, 18:49:02 UTC

I can remember about the other runtime bug. The validation at server side was failing for some kind of WUs when the WU was finished.

The current bug happens only on my new (very fast) Machine. My other machine runs well with all kind of tasks (two days runtime).
I use for the new machine a differend Rosetta@home preferences set, wich have now 1 day runtime. Will see if that solve the problem.

ID: 79386 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : Minirosetta 3.62-3.65



©2024 University of Washington
https://www.bakerlab.org