Problems and Technical Issues with Rosetta@home

Author	Message
Shawn Volunteer moderator Project developer Project scientist Send message Joined: 22 Jan 10 Posts: 17 Credit: 53,741 RAC: 0	Message 70158 - Posted: 28 Apr 2011, 20:18:43 UTC We are aware that we have had some issues with bad jobs on Rosetta@home recently. We try to ensure that these bad jobs don't slip through, but they occasionally do. When that happens, your efforts to alert us to these problems are extremely important and very much appreciated. In order to ensure that we address technical issues promptly, graduate students in the Baker lab (such as myself) will be regularly monitoring this message board for such problems. This will be in addition to the help of Mod.Sense, our vigilant forum moderator who has done a lot to ensure that these projects run as smoothly as possible. I ask that you alert us to new issues in this thread so that we can find them more easily. Thank you all once again for your commitment to Rosetta@home! ID: 70158 · Rating: 0 · rate: /

Hank Barta Send message Joined: 6 Feb 11 Posts: 14 Credit: 3,943,460 RAC: 0	Message 70164 - Posted: 29 Apr 2011, 13:32:40 UTC - in response to Message 70158. I ask that you alert us to new issues in this thread so that we can find them more easily. Thank you for helping to deal with these. This morning I have seen a number of errors with a different signature. These run for 3-5 minutes before producing an error and exiting. The characteristic error seems to be: ERROR: ct == final_atoms An example is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=382081360 thanks, hank ID: 70164 · Rating: 0 · rate: /

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,951,138 RAC: 5,607	Message 70165 - Posted: 29 Apr 2011, 15:43:51 UTC - in response to Message 70158. In order to ensure that we address technical issues promptly, graduate students in the Baker lab (such as myself) will be regularly monitoring this message board for such problems. This will be in addition to the help of Mod.Sense, our vigilant forum moderator who has done a lot to ensure that these projects run as smoothly as possible. I ask that you alert us to new issues in this thread so that we can find them more easily. Thank you all once again for your commitment to Rosetta@home! I hope these changes involve ralph@home!!! ID: 70165 · Rating: 0 · rate: /

Shawn Volunteer moderator Project developer Project scientist Send message Joined: 22 Jan 10 Posts: 17 Credit: 53,741 RAC: 0	Message 70170 - Posted: 29 Apr 2011, 19:03:42 UTC - in response to Message 70164. Thank you for helping to deal with these. This morning I have seen a number of errors with a different signature. These run for 3-5 minutes before producing an error and exiting. The characteristic error seems to be: ERROR: ct == final_atoms An example is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=382081360 thanks, hank Hey Hank, thanks for letting us know. This job has been deleted and is no longer on the queue. Apparently, this was a small test job that reported failure early, and the author marked them for deletion right away, but sometimes those jobs propagate for a while anyway. In any case, you shouldn't see this particular job anymore, but if for some reason it persists, please give us an update! ID: 70170 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 70185 - Posted: 30 Apr 2011, 20:51:29 UTC - in response to Message 70164. I ask that you alert us to new issues in this thread so that we can find them more easily. Thank you for helping to deal with these. This morning I have seen a number of errors with a different signature. These run for 3-5 minutes before producing an error and exiting. The characteristic error seems to be: ERROR: ct == final_atoms An example is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=382081360 thanks, hank I thought these were tested on RALPH before being brought over the Rosetta? If that is the case, then this job should not have slipped through. ID: 70185 · Rating: 0 · rate: /

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,951,138 RAC: 5,607	Message 70215 - Posted: 2 May 2011, 10:20:31 UTC - in response to Message 70185. I thought these were tested on RALPH before being brought over the Rosetta? If that is the case, then this job should not have slipped through. Ralph has had big problems since December.... Few wu, no comunication from team, etc I hope this situation change If you need our help to "control" the code, please give us some informations, news, details, etc ID: 70215 · Rating: 0 · rate: /

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 70224 - Posted: 2 May 2011, 17:42:25 UTC - in response to Message 70215. I thought these were tested on RALPH before being brought over the Rosetta? If that is the case, then this job should not have slipped through. Ralph has had big problems since December.... Few wu, no comunication from team, etc I hope this situation change If you need our help to "control" the code, please give us some informations, news, details, etc We're in the process of upgrading RALPH. The current server is very unstable. We do need to be far better at providing information about new projects/jobs that we test on RALPH and I'll stress that point to the lab members. The RALPH WU flow will depend on whether or not we have new jobs to test. Many jobs have already been tested. ID: 70224 · Rating: 0 · rate: /

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,951,138 RAC: 5,607	Message 70225 - Posted: 2 May 2011, 18:34:17 UTC - in response to Message 70224. We're in the process of upgrading RALPH. The current server is very unstable. We do need to be far better at providing information about new projects/jobs that we test on RALPH and I'll stress that point to the lab members. The RALPH WU flow will depend on whether or not we have new jobs to test. Many jobs have already been tested. Thanks for information :-) ID: 70225 · Rating: 0 · rate: /

Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 841,187 RAC: 19	Message 70268 - Posted: 6 May 2011, 22:24:30 UTC Last modified: 6 May 2011, 22:26:50 UTC 420656625 FOLD_N_DOCK_dagk_D2symm got Validate state Invalid after CPU time 2010.416 run time meant to be 3 hours. corresponding work unit number 420591203 got after Validate state Invalid after CPU time 3843.709 (has debug message) I posted the above message in minirosetta 2.17 on 06/05/11 Edit = Added click able links Have a crunching good day!! ID: 70268 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 70279 - Posted: 8 May 2011, 7:04:57 UTC out of memory error codes on these tasks, that is not possible as I have 3.24GB of RAM. FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_9746_0 FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_1528_0 FOLD_N_DOCK_dagk_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26520_9259_1 Error message: - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB ID: 70279 · Rating: 0 · rate: /

Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0	Message 70294 - Posted: 9 May 2011, 17:13:00 UTC Couple of possible problem WUs for you - they are 1 hour WUs, ran for around 25-30 mins, and failed to progress beyond 2-3% completion. Other 1hr ones had a consistent completion percentage roughly in line with time done so far, so I aborted both. https://boinc.bakerlab.org/rosetta/result.php?resultid=421246729 https://boinc.bakerlab.org/rosetta/result.php?resultid=421246619 Regards Zy ID: 70294 · Rating: 0 · rate: /

Ray Wang Send message Joined: 9 Mar 09 Posts: 8 Credit: 230,454 RAC: 0	Message 70295 - Posted: 9 May 2011, 18:47:34 UTC - in response to Message 70279. out of memory error codes on these tasks, that is not possible as I have 3.24GB of RAM. FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_9746_0 FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_1528_0 FOLD_N_DOCK_dagk_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26520_9259_1 Error message: - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB Hi Speedy and Greg_BE, I am Ray, a graduate student in the Baker lab. I will be taking care of the issues caused by "FOLD_N_DOCK" related jobs. As Greg_BE said, this is really not likely that these jobs could run out of all those 3.24GB of RAM. Thank you all for letting us know the problems, as well as your contribution to Rosetta@home!!! ID: 70295 · Rating: 0 · rate: /

Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0	Message 70297 - Posted: 9 May 2011, 19:59:34 UTC Last modified: 9 May 2011, 20:00:53 UTC Could someone take a peek at the list for my laptop? I'm not new to BOINC, but am a total newb at Rosetta, so I will at present miss the obvious until my feet are under the table. (running a few WUs to get used to Rosetta ready for the penthalon in a day or so) https://boinc.bakerlab.org/rosetta/results.php?hostid=1441160&offset=20 I made a post two up re slow ones, but I'm wondering if its a bad batch. Running two from same date time batch, and they are slow as well (18-19% done circa 2hrs45min for 1 hour WUs). Two running at present are Task IDs: 421246725 and 421246743 . I am starting to wonder if they are 1hr WUs, maybe there are longer ones in that batch, there were 1hr ones I did previously in the same batch, so its a bit strange. Ignore the laptop preference as set at present, it was set for 1hr when that batch was downloaded. Regards Zy ID: 70297 · Rating: 0 · rate: /

Elizabeth Send message Joined: 24 Nov 06 Posts: 1 Credit: 6,905 RAC: 0	Message 70298 - Posted: 9 May 2011, 20:18:29 UTC - in response to Message 70294. Couple of possible problem WUs for you - they are 1 hour WUs, ran for around 25-30 mins, and failed to progress beyond 2-3% completion. Other 1hr ones had a consistent completion percentage roughly in line with time done so far, so I aborted both. https://boinc.bakerlab.org/rosetta/result.php?resultid=421246729 https://boinc.bakerlab.org/rosetta/result.php?resultid=421246619 Regards Zy Hi Zy, this job is currently returning models at a reasonable rate, but we're looking into the problem. thanks for the heads up! ID: 70298 · Rating: 0 · rate: /

Adam Gajdacs (Mr. Fusion) Send message Joined: 26 Nov 05 Posts: 14 Credit: 3,239,827 RAC: 57	Message 70299 - Posted: 9 May 2011, 22:14:11 UTC - in response to Message 70295. Last modified: 9 May 2011, 22:17:56 UTC I am Ray, a graduate student in the Baker lab. I will be taking care of the issues caused by "FOLD_N_DOCK" related jobs. As Greg_BE said, this is really not likely that these jobs could run out of all those 3.24GB of RAM. They definitely can. I just noticed that one of my two rigs started trashing like hell. Turned out, a single one of these FOLD_N_DOCK WUs (https://boinc.bakerlab.org/rosetta/result.php?resultid=421379634) was using 1.45GB VM on a system with only 1GB physical memory; it was effectively running from the disk. The other core was idle because there was no memory left to run another WU on it, but if there was, it would've been about 3GBs total. ID: 70299 · Rating: 0 · rate: /

Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0	Message 70300 - Posted: 9 May 2011, 22:29:51 UTC Quick note to close the loop on my posts above. I've ended up having to do an detatch/attatch (after aborting held WUs) on my machines. Sorry about the aborts, but felt I had no choice. On restarts, the problem has disappeared, and at present at least, all appears to be progressing normaly now. Yet to complete one since detatch etc, but all three machine appear to be behaving now. No idea the reason, strange it hit all three machines. No hang over from other worries elsewhere as far as I know as things have been stable in my recent travels around BOINC. Anyway .... for what its worth, detatch etc resolved my problems, absolutely no idea why though :) Regards Zy ID: 70300 · Rating: 0 · rate: /

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 70302 - Posted: 9 May 2011, 23:17:01 UTC Last modified: 10 May 2011, 0:15:40 UTC Zydor, your expectations of what a 1hr work unit is are not realistic for R@h. And you will only confuse yourself further by modifying the runtime preference frequently. Runtime preference is actually a runtime characteristic, so your preference at time of download is actually not relevant. Here are a couple of threads that discuss how the runtime works: Discussion on increasing the default run time Newbie Q&A: discussion on runtime under Q: I am on a dial-up connection, how can I use less modem time? Newbie Q&A: Q: Progress Percent not advancing? Newbie Q&A: Q: I'm familiar with SETI and BOINC already, but what should I know about Rosetta? I'm sure you were trying to complete and return work as quickly as possible for immediate credit recognition during the penthalon... unfortunately, when you do that, some of the other nice things such as accurate progress %, and consistently completing within such a limited timeframe go out the door. Each task must complete at least one model. For some tasks you will see a model every 5 minutes or so, for others, it can take several hours. So, not all tasks are going to complete within your one hour target, and that is normal and to be expected. You might want to start a thread to discuss suggested settings for penthalon participants. Rosetta Moderator: Mod.Sense ID: 70302 · Rating: 0 · rate: /

Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0	Message 70303 - Posted: 9 May 2011, 23:27:16 UTC Last modified: 9 May 2011, 23:35:47 UTC Spoke too soon :) Another for you, from the laptop - it has had a total attatch/detatch and clean out, so this one started on a pristine clean default setup, no tweeks or o/c - but some more detail this time as I was trying to watch out for it. Task ID 421414504 finished in normal time. Task ID 421414503 had started at exactly the same time as the one that finished, except it had only completed 20% by the time the one above finished. It also was using (and still is) 270Mb of memory. That figure has slowly risen all the time it has run, not fast, but has steadily risen (and still rises at a rate of about 0.5Mb per minute - no wild fluctuation (barring the odd 100Kb or so), just steady inexorable rise. Memory Leak? Blasee phrase, but not impossible. The one that went through ok (421414504) was using 63Mb of memory when it finished. The replacement task that has started, began using 43Mb of memory, to early to say if thats a bad one as well. Good luck on the hunt .... fingers crossed you nail it tomorrow with the Pentathelon coming up. EDIT: Just seen your post above ... its not pentathelon related as such, when that starts, the longer the WU the better for me - less messing around. The short ones selected was only because the option was there and wanted to do some quick ones to check all was well before the event start's tommorow night, my not being used to Rosetta. Point noted, I will change it to default 3hours for now. I can start a thread re pentathelon if it helps you, but I'm not knowledgeable enough yet on Rosetta to comment or set it up properly. I'll give it a whirl if you want me to ... ?? Regards Zy ID: 70303 · Rating: 0 · rate: /

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 70304 - Posted: 10 May 2011, 0:36:40 UTC Last modified: 10 May 2011, 0:38:19 UTC Your observations are consistent with what I was saying... different tasks run rather differently. They have different memory requirements, they have different amount of CPU time to complete a model, and they are attempting different approaches to solving the problem so that the "better" approach can be revealed. As for memory, yes as a given model progresses, it often will gradually use more and more memory. Once the model is completed, the memory is released and if runtime preference permits, another model is begun... and then from that new local low in memory usage it will gradually use more and more as the model progresses. As for creating a new thread, what I was suggesting was to create a thread asking the questions about what traits you'd like to optimize or minimize for pentathelon and see what suggestions others may have for you. Rosetta Moderator: Mod.Sense ID: 70304 · Rating: 0 · rate: /

Zydor Send message Joined: 4 May 11 Posts: 7 Credit: 12,648 RAC: 0	Message 70305 - Posted: 10 May 2011, 0:45:33 UTC Re Thread - Okie Doke, will do Regards Zy ID: 70305 · Rating: 0 · rate: /