Message boards : Number crunching : minirosetta 2.14
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
Author | Message |
---|---|
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
I'm still getting the occasional failure with the ProteinInterfaceDesign task and its "patchdock" file after only a few seconds of processing - the example task, whose output is posted below, was created today (June 6th) Task 312328443 ERROR: Cannot open patchdock file: 1fAc_2vg9.patchdock ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/read_patchdock.cc line: 101 |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Long running job just finished - 28894 seconds of CPU, one decoy finished. Killed by watchdog. It continued to take checkpoints throughout the run. SegFault on completion. I have several other jobs across a few systems which appear to be heading down the same path. All seen to have similar task names: rs_stg0_lrlx_t"xyz"__casp8_SAVE_ALL_OUT Output follows: Task ID 344004739 Name rs_stg0_lrlx_t447__casp8_SAVE_ALL_OUT_20806_3438_0 Workunit 314064997 Created 6 Jun 2010 19:33:49 UTC Sent 6 Jun 2010 20:11:19 UTC Received 7 Jun 2010 11:25:52 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 1290176 Report deadline 16 Jun 2010 20:11:19 UTC CPU time 28896.33 stderr out <core_client_version>6.10.56</core_client_version> <![CDATA[ <stderr_txt> [2010- 6- 6 22:21:50:] :: BOINC:: Initializing ... ok. [2010- 6- 6 22:21:50:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rs_stg0_lrlx_t447__casp8.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 14400 BOINC:: CPU time: 28894.3s, 14400s + 14400s[2010- 6- 7 6:24:23:] :: BOINC InternalDecoyCount: 0 ====================================================== DONE :: 1 starting structures 28894.3 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== called boinc_finish SIGSEGV: segmentation violation Stack trace (25 frames): [0x992e4a3] [0x9958378] [0xf77eb400] [0x8c1ac97] [0x8e26032] [0x8e2646e] [0x93d1812] [0x93d3094] [0x93d511e] [0x93d1195] [0x80dac5e] [0x80d8f91] [0x810386e] [0x858db3f] [0x815324a] [0x81755cf] [0x80ace21] [0x85379f7] [0x812b7aa] [0x812c94d] [0x878038b] [0x82ff325] [0x804989b] [0x99b42dc] [0x8048121] Exiting... </stderr_txt> ]]> Validate state Valid Claimed credit 179.245358151844 Granted credit 95.9047142739811 application version 2.14 |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Here is the output from a second task - of the same "family" as the one reported in my previous post - differences: this one ended on its own after running an hour over the preferred time, not killed by watchdog, and no SegFault (could the SegFault have been caused by watchdog killing the task?) 20572 CPU seconds - only 2 decoys. (both tasks were declared as "success" and both generated reasonable credit) Task ID 344034237 Name rs_stg0_lrlx_t436__casp8_SAVE_ALL_OUT_20802_3787_0 Workunit 314090933 Created 6 Jun 2010 23:01:21 UTC Sent 6 Jun 2010 23:13:10 UTC Received 7 Jun 2010 11:47:34 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 1290176 Report deadline 16 Jun 2010 23:13:10 UTC CPU time 20572.51 stderr out <core_client_version>6.10.56</core_client_version> <![CDATA[ <stderr_txt> [2010- 6- 7 1: 1:57:] :: BOINC:: Initializing ... ok. [2010- 6- 7 1: 1:57:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rs_stg0_lrlx_t436__casp8.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 14400 ====================================================== DONE :: 2 starting structures 20572.2 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 127.612292738641 Granted credit 97.0638812927757 application version 2.14 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
3 tasks died recently with errors these two just say : Maximum elapsed time exceeded no cpu time shown and no debug output T0561_whole_SAVE_ALL_OUT_IGNORE_THE_REST_8-17_21314_677_0 T0561_whole_SAVE_ALL_OUT_IGNORE_THE_REST_3-6_21314_594_0 this one: int2_centerfirst2b_1fAc_2qwt_ProteinInterfaceDesign_23May2010_21231_230_0 is the patchdock error. |
VO Send message Joined: 4 Nov 05 Posts: 7 Credit: 3,250,754 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=313445716 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=313445716 (Notes for Project Team) Validation errors, with no apparent cause on: rb_06_02_188_708_t000__t0571_IGNORE_THE_REST_04_05_21338 Resends all failed as well. Rosetta Moderator: Mod.Sense |
cnick6 Send message Joined: 30 May 06 Posts: 29 Credit: 12,597,623 RAC: 0 |
I have one work unit that is crashing the minirosetta214 executable in Windows and Linux: Windows TASKID (with debug info): 345248849 Linux TASKID: 345967972 Workunit: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=315213770 WU Name: rb_06_10_202_765_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21404_3249_1 |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
MiniRosetta 2.14 memory use seems extremely high. I noticed another process in the "Waiting for memory" state, something I don't believe I have seen before. Upon investigation, MiniRosetta was using 800+k. Is this intentional, or is something not being freeĀ“d? Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
adrianxw, some tasks use protocols that do require more memory. These are only sent to machines that have more then the minimum memory required. I see both of your machines are reporting 4 CPUs and 2GB of memory. That's only 512MB per CPU, but I believe the check for high-memory tasks is not sensitive to the number of CPUs. So if you happened to get several high-memory tasks at the same time, that would explain the waiting for memory message. You mentioned seeing Mini using more then 800... I assume you meant MB :) Was that just one task or were several running at the same time with that usage? Task names would be helpful. Rosetta Moderator: Mod.Sense |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
The job is finished and gone now, so I don't know which it was. This one is running right now, and has ~500M, (yes, M, that dates me a bit huh?). There are not processes waiting on here at the moment. Rosetta has quite a high work share value on both my machines so it crunches them fairly quickly, I wouldn't like to guess which wu it was that was causing the event yesterday. As I recall, it was the only Rosetta wu on the machine at that time, it was Climate Prediction that was in the "Waiting for memory" state, not Rosetta. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
I have noticed the same "waiting for memory" messages several times on my system in recent weeks and I have got one right now. For me they only pop up with Rosetta CASP9 WUs with huge protein structures. Looking at your task history, adrianxw, you were probably processing rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_1071_0. The one eating up my memory today is rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_19046_0 from the same batch as yours. There is nothing to worry about with these as they free up the memory again as soon as they are completed. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This errored after 22 sec. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=316550658 Sat 19 Jun 2010 11:20:51 EST|rosetta@home|Output file rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_15952_0_0 for task rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_15952_0 absent <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process got signal 11 </message> |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Another error, this ran for 1hr 59min i have a four hour run time set with two hour switching projects. It ran the first two hours O.K. when it restarted it failed. eed_4_eed_1fm4_ProteinInterfaceDesign_7Jun2010_21383_177_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=316587259 <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 14400 SIGSEGV: segmentation violation Stack trace (11 frames): [0x992e4a3] [0x9958378] [0xffffe500] [0x84bd3da] [0x882dfff] [0x812b7aa] [0x812c94d] [0x878038b] [0x8049a2a] [0x99b42dc] [0x8048121] Exiting... </stderr_txt> |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 34 |
Same again... rb_06_21_217_781_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21462_3794 ... two other projects stopped "Waiting for memory" 882M in use. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
cnick6 Send message Joined: 30 May 06 Posts: 29 Credit: 12,597,623 RAC: 0 |
Can one of the mods please look into the low-credit issues lately? See this thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5366 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Moderators do not have access to any credit information beyond what you see on the task and WU. The Project Team maintains all of the BOINC databases and etc. but they are pretty busy with CASP at the moment. I can only assure you that credit is granted based on models completed, and that hooks were placed in the code to report CPU time on a per model basis so that specific protocols or proteins that have a high variability in CPU time between models can be reviewed in more detail. Generally when credit is that dramatically low, it is the result of a long running model. In other words, if models typically take 10 minutes of CPU time, and your machine runs for an hour and has completed 6 models, and then the 7th takes 3 hours (or more and perhaps is eventually ended by the watchdog) then the credit granted is going to be on par with 70 minutes of processing rather then the 4 hours that was actually spent. This is why there is a thread for reporting long-running models. Over time, as revisions are made and new protocols become accepted for future use, changes are found which reduce the number of such outlaying long-running models. But if a new protocol is not found to produce better results then prior methods, it will not be run in the future anyway, and so tracking down the 1% outlayers ends up consuming resources that could be invested into developing another new protocol. Rosetta Moderator: Mod.Sense |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Three recent failures on W7 Task 347583183 ab_06_19_d000_top_broker_server_models_21455_46857_0 Task 347583182 ab_06_19_d000_top_broker_server_models_21455_46856_0 Task 347583171 ab_06_19_d000_top_broker_server_models_21455_46845_0 all failed as follows Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached ERROR: Option file open failed for: ab_06_19_d000_top_broker_server_models.flags </stderr_txt> ]]> |
billy ewell 1931 Send message Joined: 30 Mar 07 Posts: 14 Credit: 6,899,522 RAC: 0 |
Task ID 349192260: This is a ProteinDesignInterface unit that consumed 7.5 hours of cpu time on an Intel quad 2.66 with 4 gigs of memory. There were 98 starting structures, 98 attempts and 98 decoys resulting. It really irritates me to see the scoring results when a claimed credit amount of 130.12 was reduced to a granted amount of 36.82. This seems to be quite a COMMON result when processing the PDI work units. I am NOT a points chaser but a dedicated supporter of research science and its potential impace for mankind and the world. BUT I still wonder if perhaps 10% or more of my fairly high-quality computing power is going to waste. Three of my computers; an i7 930 and two 9400 2.66 quads were purchased and run 24/7 solely in support of projects like rosetta and other BOINC research initiatives. Am I terribly wrong here or do I have a legitimate concern as a dedicated and loyal supporter of Rosetta and the current CASP? My account is 160868 I appreciate so very much the dedicated professional designers of this project and the loyal crunchers who particularly make it possible. Bill: Austin, Texas USA |
mhhall Send message Joined: 28 Mar 06 Posts: 7 Credit: 10,193,127 RAC: 5 |
Hi folks, My system is currently executing WU 317305089. BOINC is showing following properties that would seem to indicate process is stuck and not checkpointing properly. CPU Time at last checkpoing: 13:18:13 CPU Time : 15:20:20 Fraction done: 98.925% Would hate to kill a job so close to comletion, but I've got to wonder if this is really going to complete. |
Jochen Send message Joined: 6 Jun 06 Posts: 133 Credit: 3,847,433 RAC: 0 |
Would hate to kill a job so close to comletion, Does this task still create CPU-load? If yes, leave it running, if not try restarting the BOINC-manager (make sure, the client processes will be stopped as well). If it still doesn't create CPU-load after restarting the manager, you should abort it. cu Joe |
Message boards :
Number crunching :
minirosetta 2.14
©2024 University of Washington
https://www.bakerlab.org