"Rosetta v4.12 i686-pc-linux-gnu" : fixed 20 h CPU time, fixed 20 credits

Message boards : Number crunching : "Rosetta v4.12 i686-pc-linux-gnu" : fixed 20 h CPU time, fixed 20 credits

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,889,918
RAC: 4,440
Message 93751 - Posted: 7 Apr 2020, 19:02:49 UTC
Last modified: 7 Apr 2020, 19:13:17 UTC

Hi,

I downloaded tasks on April 6, 04:49 UTC, onto six Linux x86-64 computers. They received a mixture of "Rosetta v4.12 x86_64-pc-linux-gnu" and "Rosetta v4.12 i686-pc-linux-gnu" tasks.

The x86_64 tasks behave as expected:

  • Actual CPU time is near the configured target CPU time, or less.
  • I can change target CPU time after the fact for tasks which are ready to run, by changing it at the account page and triggering a project update.
  • I get credit for results which, at first glance, makes sense, and shows the typical variability WRT credits/hour.



But the i686 jobs on these computers are broken:


  • Actual CPU time is always slightly over 20 hours, while I had configured target times of 16 hours or 8 hours.
  • Credit is fixed to 20.00 credits for each valid result.


What's up with that?

(I am sure I have seen other people complaining here and elsewhere about poor credit of series of tasks, perhaps even particularly about these exact 20.00 credits/result. Sorry for not going on a hunt for links...)

Edit,
I downloaded the tasks in multiple requests per computer. Any request received tasks of either the former or of the latter sort. All in all I have >800 valid results from these downloads by now (there are dual-processor servers among these computers), but I haven't counted how many of them are x86-64 and how many i686. However, all i686 tasks on all computers which got these are showing this behavior, while all x86-86 tasks on the very same computers behave well.

ID: 93751 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93763 - Posted: 7 Apr 2020, 20:04:52 UTC - in response to Message 93751.  

Your machines are hidden, so noone can examine your machines or WUs to help you understand what is going on.

There are still some cases where the watchdog is finding long-running models and kicking in to end tasks. This extended runtime without reaching the end of a model results in poor credit. Have you seen any similar issues with the new application version running on Ralph?
Rosetta Moderator: Mod.Sense
ID: 93763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,889,918
RAC: 4,440
Message 93776 - Posted: 7 Apr 2020, 21:11:56 UTC
Last modified: 7 Apr 2020, 21:13:32 UTC

Example of a good task (v4.12 x86-64):
1140903423

Example of a bad task from the same host (v4.12 i686):
1140917005

This was at a time when I had 16 h target CPU time configured.

stderr of the bad task:
<core_client_version>7.8.3</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.12_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--covid_spike_design_boinc_v1.xml @flags_jhr_cv -in:file:silent 3qt0mq9p_Junior_HalfRoid_vs_COVID-19_design1.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 3qt0mq9p_Junior_HalfRoid_vs_COVID-19_design1.zip @3qt0mq9p_Junior_HalfRoid_vs_COVID-19_design1.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 72161s, 14400s + 57600s[2020- 4- 7  5:12: 9:] :: BOINC 
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE ::     1 starting structures    72161 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
05:12:09 (86554): called boinc_finish(0)

</stderr_txt>
]]>


From spot checks, all other tasks of this faulty type, also on other computers of mine, show the same pattern, i.e. "...default.out.gz: could not open file." and "This process generated 1 decoys from 1 attempts".

I am not yet registered at Ralph. Maybe I'll try it tomorrow.
ID: 93776 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93781 - Posted: 7 Apr 2020, 21:27:55 UTC

Yep, the one you called bad task was ended by the watchdog, as you say, the watchdog kicks in 4 hours after the runtime preference. I'm hopeful that the new version coming soon will reduce the number of these.
Rosetta Moderator: Mod.Sense
ID: 93781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,889,918
RAC: 4,440
Message 93783 - Posted: 7 Apr 2020, 21:34:44 UTC - in response to Message 93751.  
Last modified: 7 Apr 2020, 21:43:36 UTC

I wrote:
All in all I have >800 valid results from these downloads by now (there are dual-processor servers among these computers), but I haven't counted how many of them are x86-64 and how many i686. However, all i686 tasks on all computers which got these are showing this behavior, while all x86-86 tasks on the very same computers behave well.

I counted now:
circa 700 v4.12 x86-64 results on 5 of 6 active computers, all of these tasks good
exactly 110 v4.12 i686 results on 3 of the 6 computers, all of these tasks bad in the same way

I had more i686 tasks in the queue, cancelled them.

(That is, 1 computer got only i686 tasks until the point when I went checking, 2 computers got both kinds, the other 3 computers only x86-64 tasks. The 1 unlucky computer will begin working on x86-64 tasks now like the others.)

That is, in my observation, it is a systematic 100% repeatable fault of the i686 build, which does not occur with the x86-64 build.
ID: 93783 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93787 - Posted: 7 Apr 2020, 22:39:48 UTC
Last modified: 7 Apr 2020, 22:40:36 UTC

Thank you for the analysis and summary. That is very helpful in pinpointing problem areas. I have reported the issue of i686 Linux WUs never completing first model, causing watchdog end, to the Project Team.
Rosetta Moderator: Mod.Sense
ID: 93787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
biodoc

Send message
Joined: 19 Feb 06
Posts: 14
Credit: 30,717,792
RAC: 0
Message 93890 - Posted: 8 Apr 2020, 18:26:28 UTC

@xii5ku, thank you for tracking this problem down. My caches are full of the linux i686 tasks. I would think it would be a good idea to stop the server from sending this work until the bug in the app is fixed.
ID: 93890 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93898 - Posted: 8 Apr 2020, 19:10:24 UTC
Last modified: 14 Apr 2020, 12:59:19 UTC

To any that have found this thread because they are having Linux i686 issues, please join Ralph (project url is: http://ralph.bakerlab.org/) with your machine. This will help with testing when changes are made to address this, and confirm they are working.

No promises on when a new version will be available there. It may take some time.

If you would like to try creating a simple cc_config.xml file, you can get BOINC to use the x86_64 version instead of the i686 version that is having trouble. There is an example here.
Rosetta Moderator: Mod.Sense
ID: 93898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,889,918
RAC: 4,440
Message 94039 - Posted: 10 Apr 2020, 7:50:04 UTC
Last modified: 10 Apr 2020, 8:30:14 UTC

Last night I received a bunch of tasks from Ralph to 4 of the same 6 computers.
I had the default target CPU time configured at Ralph, which is 1 hour.

I have 257 valid results, of 257 tasks received:

  • All 169 Rosetta v4.15 x86_64-pc-linux-gnu tasks finished after 1 hour and generated at the order of 20...40 decoys, according to spot checks.
  • All 88 Rosetta v4.15 i686-pc-linux-gnu tasks tasks finished after 5 = 1+4 hours and generated (3x) 9, 8, (3x) 7, 6, (3x) 5, (8x) 4, (5x) 3, (9x) 2, (55x) 1 decoys.

So there is slight progress from v4.12 to v4.15 on my hosts, but not a breakthrough yet.
I put a report at Ralph's forum.

ID: 94039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
magiceye04

Send message
Joined: 11 May 11
Posts: 11
Credit: 1,702,178
RAC: 0
Message 94372 - Posted: 13 Apr 2020, 20:26:53 UTC - in response to Message 93898.  
Last modified: 13 Apr 2020, 20:28:44 UTC

To any that have found this thread because they are having Linux i686 issues, please join Ralph (project url is: http://ralph.bakerlab.org/) with your machine. This will help with testing when changes are made to address this, and confirm they are working.

No promises on when a new version will be available there. I may take some time.

If you would like to try creating a simple cc_config.xml file, you can get BOINC to use the x86_64 version instead of the i686 version that is having trouble. There is an example here.


I have some broken WUs on my PC, new version 4.15
*i686*
Why are INTEL686 WUs sent to AMD-PCs?
The x86 run perfect, please keep away these i686 WUs from non-intel-PCs.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1148091774
ID: 94372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
magiceye04

Send message
Joined: 11 May 11
Posts: 11
Credit: 1,702,178
RAC: 0
Message 94373 - Posted: 13 Apr 2020, 20:32:08 UTC - in response to Message 93898.  



If you would like to try creating a simple cc_config.xml file, you can get BOINC to use the x86_64 version instead of the i686 version that is having trouble. There is an example here.

What exactly from this example is needed?

<no_alt_platform>1</no_alt_platform> ?

I have an existing config file and only want to add the relevant line.
ID: 94373 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Millenium

Send message
Joined: 20 Sep 05
Posts: 68
Credit: 184,283
RAC: 0
Message 94375 - Posted: 13 Apr 2020, 20:34:29 UTC - in response to Message 94372.  


I have some broken WUs on my PC, new version 4.15
*i686*
Why are INTEL686 WUs sent to AMD-PCs?
The x86 run perfect, please keep away these i686 WUs from non-intel-PCs.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1148091774


Lol, they are called "intel686" exactly because they are x86, not because they are for intel cpu. And they aren't 64bit.
ID: 94375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurent

Send message
Joined: 15 Mar 20
Posts: 14
Credit: 88,800
RAC: 0
Message 94376 - Posted: 13 Apr 2020, 20:39:41 UTC - in response to Message 94372.  


I have some broken WUs on my PC, new version 4.15
*i686*
Why are INTEL686 WUs sent to AMD-PCs?
The x86 run perfect, please keep away these i686 WUs from non-intel-PCs.


All AMD CPUs starting with the Athlon K7 implement i686. That's ~2000, or 20 years ago. How old are your computers?

It is called i686 because intel created the instruction set. AMD can run it (and Windows) because they bought the right to use the platform from intel.

The SSSE problem popping up is related to 64 bit AMDs not implementating a part of Intel's stuff.
ID: 94376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94390 - Posted: 13 Apr 2020, 22:12:53 UTC - in response to Message 94373.  

Yes, so you could omit the section called <log_flags>.

This is only for Linux hosts. And only until a new version resolves the i686 issue where all tasks are never finishing their first model, and are ended by the watchdog.

You still need the rest of the shell, so it can be reduced to this:
<cc_config>
  <options>
    <no_alt_platform>1</no_alt_platform>
  </options>
</cc_config>

Rosetta Moderator: Mod.Sense
ID: 94390 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
magiceye04

Send message
Joined: 11 May 11
Posts: 11
Credit: 1,702,178
RAC: 0
Message 94391 - Posted: 13 Apr 2020, 22:13:51 UTC

OK - then i686=32bit and x64 = 64bit?

But the needed solution is still: run only the 64bit WUs on AMD CPU.

Why is the 20 years old i686 code still in use if the problem is known?
ID: 94391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurent

Send message
Joined: 15 Mar 20
Posts: 14
Credit: 88,800
RAC: 0
Message 94412 - Posted: 13 Apr 2020, 23:05:04 UTC - in response to Message 94391.  

OK - then i686=32bit and x64 = 64bit?

But the needed solution is still: run only the 64bit WUs on AMD CPU.

Why is the 20 years old i686 code still in use if the problem is known?


All current intel and AMD for PC (windows) are based on the x86 system, invented 1978. The i686 is the 6 generation and works very fine.

x86-64, also called AMD64 is the 64 bit extension for x86. That one was invented by AMD and crosslicenced to intel. All current PC CPUs, even the ones from Intel include that extension, as well as all the previous extensions (i286, i386, i486, ... Till roughly generation 12, depending on how you count the generations).

There is no problem with i686, there is a problem with Rosetta. You are barking up the wrong tree. Don't blame intel or AMD.
ID: 94412 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94438 - Posted: 14 Apr 2020, 12:50:43 UTC

Can an more experienced Linux user help here? Tasks not starting. active_task_state: UNINITIALIZED.
Rosetta Moderator: Mod.Sense
ID: 94438 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JohnDK
Avatar

Send message
Joined: 6 Apr 20
Posts: 33
Credit: 2,390,240
RAC: 0
Message 94559 - Posted: 15 Apr 2020, 17:44:01 UTC

I've put <no_alt_platform>1</no_alt_platform> in the cc_config.xml file, but still got a i686 WU. I did choose read config files but did not restart BOINC, is that necessary?

https://boinc.bakerlab.org/rosetta/results.php?hostid=4063805
ID: 94559 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94569 - Posted: 15 Apr 2020, 19:03:04 UTC - in response to Message 94559.  

Is the no_alt_platform tag within the options tag, within the cc_config tag?
Rosetta Moderator: Mod.Sense
ID: 94569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JohnDK
Avatar

Send message
Joined: 6 Apr 20
Posts: 33
Credit: 2,390,240
RAC: 0
Message 94570 - Posted: 15 Apr 2020, 19:21:49 UTC

This is my cc_config

<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
</log_flags>
<options>
<max_file_xfers>16</max_file_xfers>
<max_file_xfers_per_project>16</max_file_xfers_per_project>
<save_stats_days>365</save_stats_days>
<use_all_gpus>1</use_all_gpus>
<skip_cpu_benchmarks>1</skip_cpu_benchmarks>
<no_priority_change>0</no_priority_change>
<allow_multiple_clients>0</allow_multiple_clients>
<dont_contact_ref_site>1</dont_contact_ref_site>
<http_transfer_timeout>100</http_transfer_timeout>
<max_tasks_reported>256</max_tasks_reported>
<allow_remote_gui_rpc>0</allow_remote_gui_rpc>
<no_alt_platform>1</no_alt_platform>
</options>
</cc_config>
ID: 94570 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : "Rosetta v4.12 i686-pc-linux-gnu" : fixed 20 h CPU time, fixed 20 credits



©2024 University of Washington
https://www.bakerlab.org