Message boards : Number crunching : New WUs failing
Author | Message |
---|---|
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 1 |
These are all failing: DESIG_HYBRID_1_... ACT_1XA4_HYBRID_... |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 1 |
These ones are failing in a high percentage in Linux hosts. PF12228.7_nojumps_aivan PF12228.7_jumps_aivan They do not respect the default computing time and when finish they fail with an error in the output file and "Stream information inconsistent" message. ====================================================00 <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu @PF12228.7.nojumps.flags -in:file:boinc_wu_zip PF12228.7.nojumps.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2966551 Starting watchdog... Watchdog active. BOINC:: CPU time: 43210.5s, 14400s + 28800s[2018- 8-31 7: 4:54:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43210.5 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 07:04:54 (10851): called boinc_finish(0) pure virtual method called terminate called without an active exception </stderr_txt> ]]> |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 1 |
All DESIG_HYBRID_1_... WUs keeping crashing in all systems <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_DESIG_HYBRID_1_ACT_1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_DESIG_HYBRID_1_ACT_1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1125040 Starting watchdog... Watchdog active. ERROR: Unable to open weights/patch file. None of (./)stage1.wts or (./)stage1.wts.wts or minirosetta_database/scoring/weights/stage1.wts or minirosetta_database/scoring/weights/stage1.wts.wts exist ERROR:: Exit from: src/core/scoring/ScoreFunction.cc line: 2748 BACKTRACE: [0x5a62de6] [0x4370f34] [0x4380a82] [0x439a000] [0x439bbb5] [0x3698c75] [0x373a7d2] [0x3740317] [0x378c123] [0x378d621] [0x382ba98] [0x382b5a3] [0x413771] [0x5fff8cc] [0x610b97] BOINC:: Error reading and gzipping output datafile: default.out 10:16:13 (1370): called boinc_finish(1) </stderr_txt> ]]> |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,013,988 RAC: 6,289 |
It looks like you have 2 machines running. One running Ubuntu 18.04 (457 errors) and the other running Ubuntu 16.04 (4 errors). The latest failing Ubuntu 18.04 tasks seem to be unable to open files. The error messages seem to say the files do not exist or are short. Do you have enough free space on the disk partition that BOINC uses? The 16.04 machine seems fine. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Sometimes files go "missing" due to anti-virus as well, so another thing to check. Rosetta Moderator: Mod.Sense |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,013,988 RAC: 6,289 |
Sometimes files go "missing" due to anti-virus as well, so another thing to check. I think that is true for Windows machines, but I have not heard of the problem on Linux. On Linux machines, a more common problem is the automatic partitioning of the disk. The default partitioning does not allocate enough space to the partition where BOINC puts its directory. |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 1 |
Hi, thanks for your post The Ubuntu 18.04 host is the one crunching Rosetta almost full time since half a month. It needed the "hack" to allow Rosetta 4.07 application WUs to run without crashing all units. I don't think it is related to the errors. No antivirus and enough free disk space, we know Rosetta is good at detecting insufficient disk when downloading. So I do think these are the issues. Going to the type of units: - I have not found anyone that have crunched successfully a DESIG_HYBRID_1_... or an ACT_1XA4_HYBRID_... WU, all my wingmen, linux or windowes, errored as well. It seems to me WU fault. I'm aborting them whenever i find one - Some of the PFxxxxx.x_(no)jumps_aivan_...(e.g. PF12228.7_jumps_) units are a problem for linux systems, more precisely ubuntu, in windows hosts they seem to crunch without problem. - PF12228.7_jumps_... units for example do no respect the crunching time (I've tried with default and 4 hours duration) but they apparently finish OK crunching the WU but when closing the unit, it is declared invalid in some cases and no credit is awarded. I've seen this in many other hosts from other crunchers Example of valid unit : https://boinc.bakerlab.org/result.php?resultid=1025270853 =====================================================================0 <core_client_version>7.9.3</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu @PF12228.7.jumps.flags -in:file:boinc_wu_zip PF12228.7.jumps.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2826017 Starting watchdog... Watchdog active. BOINC:: CPU time: 43755.4s, 14400s + 28800s[2018- 8-31 12:26:34:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43755.4 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 12:26:34 (12361): called boinc_finish(0) </stderr_txt> ]]> Example of invalid unit : https://boinc.bakerlab.org/result.php?resultid=1025347236 =================================================== <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu @PF12228.7.jumps.flags -in:file:boinc_wu_zip PF12228.7.jumps.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2797347 Starting watchdog... Watchdog active. BOINC:: CPU time: 43599.6s, 14400s + 28800s[2018- 8-31 21:42:23:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43599.6 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 21:42:23 (15153): called boinc_finish(0) </stderr_txt> ]]> |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Looks to be similar to the problem I just reported for Windows 10 with PF... tasks. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,582,094 RAC: 7,913 |
Summer holidays has gone. I hope the R@H team starts to review the code....and participates in the forum!! |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 1 |
cis_paper_simulation_1 units failing in all systems, mine and also wingmen's <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu @cis_paper_simulation_1_2_4_5_nmet.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3695217 ERROR: Illegal value for integer option -cyclic_peptide:n_methyl_positions specified: 1_2_4_5 </stderr_txt> ]]> |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,582,094 RAC: 7,913 |
cis_paper_simulation_1 units failing in all systems, mine and also wingmen's Same here with my Win10, but with two different errors on cis_paper 1 ERROR: Cannot open file "native.pdb" 2 (0x1) - exit code 1 (0x1)</message> And also error on Design_Hybrid ERROR: Unable to open weights/patch file. None of (./)stage1.wts or (./)stage1.wts.wts or minirosetta_databasescoring/weights/stage1.wts or minirosetta_databasescoring/weights/stage1.wts.wts exist |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,582,094 RAC: 7,913 |
cis_paper_simulation_1 units failing in all systems, mine and also wingmen's Again all cis_paper fail. Admins read the forum? It's frustating Why not to test it on Ralph? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
cis_paper_simulation_1 units failing in all systems, mine and also wingmen's Sorry I didn't notice these posts before. I reported the same in the Rosetta 4.0x pinned thread the other day. Fortunately they fail within seconds, but that's still a mass of wasted downloads. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,582,094 RAC: 7,913 |
Sorry I didn't notice these posts before. I reported the same in the Rosetta 4.0x pinned thread the other day. Oh, well, it's not a problem I think admins have not read either thread :-( |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,184,189 RAC: 10,001 |
Things will quieten down for a while as it seems like most PF & cis tasks have cleared from my buffers now and pretty much every job is completing successfully again, for the loss of about 500 on my RAC... |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,582,094 RAC: 7,913 |
for the loss of about 500 on my RAC... It's not only a question of Rac. It's a question of "respect" for volunteers. For example, see the "glibc problem": it's months that users with recent Ubuntu disto have to change some parameters 'cause the version of glibc in rosetta is old. Why not fix the problem? They should want to attract new volunteers, do not push them away |
BelgianEnthousiast Send message Joined: 25 May 15 Posts: 5 Credit: 1,023,045 RAC: 0 |
Does anyone observe faulty WU's on WIN 10 platform ? Since april 1st, I crunched 366 WU's in total. Up until september 9th 176 WU's without any errors. Since September 9th, I crunched 190 WU's, but gradually racking up 33 (till today, Oct 6th) failed WU's. Not sure how to find out which ones failed. Can anyone help ? I'd like to dig a little deeper. Apart from that, any comments as to why all of a sudden so many WU's fail ? I'm running 2 cores for Rosetta, 5 cores for LHC (5 core WU's). I do not observe any issues on LHC, so I'm pretty sure it's not my rig that's having issues. Many thanks for your advice ! BE. |
LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0 |
Apart from that, any comments as to why all of a sudden so many WU's fail ? Log on to your Rosetta account and click the "view" link next to TASKS. On the next screen, you can view tasks by completion state. Most of your errors were caused on Oct 3 because the WUs did not complete before the deadline. There are a few things that will cause this; resetting the project, not processing jobs for a period of time and other reasons. You probably want to keep an eye on the deadline on jobs that you have queued up, in case your account needs setting adjusted, but that doesn't appear to be the case. When you look at the reason for the errors, many times you can tell if there was just something wrong with the WU (you had some of these) or a local problem. Hope this helps. |
Message boards :
Number crunching :
New WUs failing
©2024 University of Washington
https://www.bakerlab.org