Message boards : Number crunching : Minirosetta 3.62-3.65
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
there are various/many limitations with boinc i'd guess in part due to the protocol design. For most part it works well, then in the real world we have the extremes which fall our of the 'normal' design ranges of boinc i'd guess. i read that boinc is based on a 'one way polling' design where all network requests are initiated by the client, this limits 'push' notifications from being possible as a solution. No need to "push" anything, the BOINC server has the ability to tell such things to the client simply on the next scheduler request. It's even possible to choose, if a work unit should be aborted even if it already started or only if it has not started yet. So nothing "extreme", just something that's been implemented for long time ago and already in use on other projects when needed. Of course I can't say, if the ancient version of BOINC Rosetta is using is able to send such messages, but if not, than that's just one more point on the long long list of reasons why they need to upgrade. . |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
there are various/many limitations with boinc i'd guess in part due to the protocol design. For most part it works well, then in the real world we have the extremes which fall our of the 'normal' design ranges of boinc i'd guess. i read that boinc is based on a 'one way polling' design where all network requests are initiated by the client, this limits 'push' notifications from being possible as a solution. I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something. There's also the possibility to abort WUs, which has already been sent out to a client on the next scheduler request of this client. No extra communication. WUs like that appear than as "Aborted by server" (or project, not sure) in the task list. . |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something. I'm not aware of an abort option other than canceling jobs. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
I think there are some client side and server side settings that are available, even for our old version, that would increase the communication with our servers but would put more pressure on our database server. Other options would have to be developed within our app and server (regardless of server version). Unless I'm missing something. I think this logic is expected to be in the client. EXIT_UNSTARTED_LATE 200 Task was aborted due to it having not started and already past the deadline. http://boincfaq.mundayweb.com/index.php?viewCat=3&sessionID=c5a9905b2172d67bb1c1ff12eedd0b6c |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
... Definitely, some sort of 'remote abort' command that can be issued from the server to all clients holding a certain job would be ideal.. in the link rjs5 shared above, there's one code EXIT_ABORTED_BY_PROJECT 202 that sounds like it could be it... but it still doesn't seem to be something that 'pulls the plug' remotely, rather it's just a classification of how to handle WUs returned for a job that was cancelled by the server (said Work Units still run their full course from the look of it, sadly) - more reading found here: http://boinc.berkeley.edu/dev/forum_thread.php?id=7704&sort=5 - I also took a look around the BOINC dev docs and didn't see anything like this, which is sad as it sounds like a very practical and useful function. I'll continue looking though... Ideally, would want some way to actually tell clients to stop crunching a certain task and bail out early (move on to the next useful task) rather than having their cycles spinning for something that is cancelled. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
Just pass the DEADLINE date to the client job as a command line parameter and have the client periodically check the current date against the deadline. When the deadline is passed, abort. It would be pretty easy to add the "if ( date > deadline ) abort;" to the code to determine whether to continue or not. I would probably place it just before the checkpointing code so the client could skip the checkpointing if deadline is passed. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
actually i'm wondering a little if it may be fun to have a utility that works like ifttt https://ifttt.com/ or tasker http://www.androidcentral.com/tasker-review-thing-you-need-do-all-things for boinc, then one can literally create all kind of wiz bang scheduling & mini automation that one prefers :D lol |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
So if anyone is aware of an actual abort option to push that info onto clients that may be holding or running a job that we'd like to abort in a timely manner, please let me know. A possible option that I am aware of is using a trickle message and coding it into our application and of course we'd also need to code up the server side logic. But I'm not sure there's an abort option that pushes that info out to clients currently in BOINC. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
So if anyone is aware of an actual abort option to push that info onto clients that may be holding or running a job that we'd like to abort in a timely manner, please let me know. A possible option that I am aware of is using a trickle message and coding it into our application and of course we'd also need to code up the server side logic. But I'm not sure there's an abort option that pushes that info out to clients currently in BOINC. yeah, it seemed trickle messages is possibly an 'only' way and it'd seem there would need to be enhancements in both the server codes as well as r@h app codes http://boinc.berkeley.edu/trac/wiki/TrickleApi e.g. when r@h app receives a trickle message, interprets it for an 'early end' command and it wraps up the job and submits that to the server as a completed task i'm not too sure if there could be client dependencies, e.g. that certain client versions may have different or don't have the trickle feature |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,245,383 RAC: 9,571 |
Bit of a weird error in this task. It validated ok and only finished because it reached 99 decoys but... 14h2ld2203_fold_and_dock_SAVE_ALL_OUT_319506_998_0 bad torsion type for JumpAtom: 1 |
sinspin Send message Joined: 30 Jan 06 Posts: 29 Credit: 6,574,585 RAC: 0 |
I got a lot of errors with long running *krypton* WUs. Seems that all of them stop with the same error: exceeded elapsed time limit 141525.53 (500000.00G/3.53G) some examples: https://boinc.bakerlab.org/rosetta/result.php?resultid=781463729 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463704 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463682 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463677 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463675 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463670 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463662 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463660 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463658 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463656 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463652 https://boinc.bakerlab.org/rosetta/result.php?resultid=781463647 I am not sure if other WUs wich run up to the time limit are successful or not. |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
Sorry about that! I'm investigating. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I've increased our default fpops limit so hopefully this will prevent such errors in the future but we can't update the limit for the jobs that have already been submitted. Thanks for the heads up on this error. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
The Abort (202) command is used all the time on World Community Grid to cancel jobs in your buffer before they start, both on Clean Energy Phase 2 (CEP2) and Mapping Cancer Markers (MCM). I have several in my logs now, and see them all the time. I am not a software developer and don't know how it is implemented, but it works something like this: In those projects where a quorum of two is required to validate results, the two work units are sent out simultaneously to two different users under a given time limit (say 7 days). Suppose the first one comes back in a couple of days, but the second one is delayed. After 5 days or so the server sends out another copy to a "trusted computer" (or whatever it is called). I am one of them, because I leave my machines on 24/7 and have a low error rate. So it sits in my buffer for a few hours, but before I can get to it, the second machine returns its results. Therefore, they don't need me to work on it, and they send me the "Server Abort" command for that work unit. There are plenty of people on WCG who can explain it in detail; try SekeRob first. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
This sounds like a scheduler function for redundant jobs which makes sense but this is not a general abort option that we can use without having to modify server and/or application code to my knowledge. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I believe it is a check for host-specific messages during the scheduler requests. Big hit to database. See <msg_to_host/> tag https://boinc.berkeley.edu/trac/wiki/ProjectOptions Rosetta Moderator: Mod.Sense |
sinspin Send message Joined: 30 Jan 06 Posts: 29 Credit: 6,574,585 RAC: 0 |
I've increased our default fpops limit so hopefully this will prevent such errors in the future but we can't update the limit for the jobs that have already been submitted. Thanks for the heads up on this error. Thanks. The "500000G" in the error message, what means that? How can i calculate the amount of Seconds, the task can run, from this value? Is that around 39Hrs? If yes, why? My task runtime is set to two days. (that saves a lot of network traffic and increase the crunching efficiency) |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
My task runtime is set to two days. (that saves a lot of network traffic and increase the crunching efficiency) .. Well there might be the underlying issue, the 2 day runtime was deprecated sometime last year after some type of issue with it (I forget the details at this point, but it wasn't the error your seeing but another one) with certain protocols or job types. I suggest scaling back your target runtime to one of the currently available values (max is 1 day). Cheers! |
sinspin Send message Joined: 30 Jan 06 Posts: 29 Credit: 6,574,585 RAC: 0 |
I can remember about the other runtime bug. The validation at server side was failing for some kind of WUs when the WU was finished. The current bug happens only on my new (very fast) Machine. My other machine runs well with all kind of tasks (two days runtime). I use for the new machine a differend Rosetta@home preferences set, wich have now 1 day runtime. Will see if that solve the problem. |
Message boards :
Number crunching :
Minirosetta 3.62-3.65
©2024 University of Washington
https://www.bakerlab.org