R@H Scientists/Coders: An analysis of the Rosetta binaries...

Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78449 - Posted: 15 Jul 2015, 18:53:06 UTC

I added the Windows (SSE2) and linux (64bit SSE4) builds to Ralph@home and the linux app has a 28% failure rate.

The Windows app looks good at a 97% success rate. The current 32bit linux app also has a 97% success rate which is expected.

I'll have to look into the 64bit linux app failures and see what's actually going on. For the linux build I used the sse4 optimization option, a more current GCC, and a 64bit build which gave around a 13% improvement.
ID: 78449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78450 - Posted: 15 Jul 2015, 21:31:56 UTC

Looks like the errors are likely coming from older cpu's that don't support sse4.
ID: 78450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,012,385
RAC: 6,215
Message 78451 - Posted: 15 Jul 2015, 23:53:24 UTC - in response to Message 78449.  

I added the Windows (SSE2) and linux (64bit SSE4) builds to Ralph@home and the linux app has a 28% failure rate.

The Windows app looks good at a 97% success rate. The current 32bit linux app also has a 97% success rate which is expected.

I'll have to look into the 64bit linux app failures and see what's actually going on. For the linux build I used the sse4 optimization option, a more current GCC, and a 64bit build which gave around a 13% improvement.



It is unlikely to be an SSE4 issue. Since the failing CPU is an AMD CPU, it is more likely an SSE3 or SSSE3 problem.
ID: 78451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,573,506
RAC: 7,165
Message 78454 - Posted: 16 Jul 2015, 9:14:08 UTC - in response to Message 78449.  
Last modified: 16 Jul 2015, 9:15:17 UTC

I added the Windows (SSE2) and linux (64bit SSE4) builds to Ralph@home and the linux app has a 28% failure rate.
For the linux build I used the sse4 optimization option, a more current GCC, and a 64bit build which gave around a 13% improvement


Some points:
1) Please, insert brief description of app in Applications Page, like this.
2) Which is the Win32 improvement?
3) The large part of computational power is generated by Win64 hosts. So, waiting for this. :-)
ID: 78454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78462 - Posted: 17 Jul 2015, 17:47:24 UTC - in response to Message 78454.  

I added the Windows (SSE2) and linux (64bit SSE4) builds to Ralph@home and the linux app has a 28% failure rate.
For the linux build I used the sse4 optimization option, a more current GCC, and a 64bit build which gave around a 13% improvement


Some points:
1) Please, insert brief description of app in Applications Page, like this.
2) Which is the Win32 improvement?
3) The large part of computational power is generated by Win64 hosts. So, waiting for this. :-)


Will do.

The Win32 minimally improved (less than 2%) with sse2.
ID: 78462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 78463 - Posted: 18 Jul 2015, 0:35:11 UTC

What is the baseline or 'normal' failure rate on the different platforms?
ID: 78463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,012,385
RAC: 6,215
Message 78467 - Posted: 18 Jul 2015, 16:47:23 UTC - in response to Message 78463.  
Last modified: 18 Jul 2015, 16:48:15 UTC

What is the baseline or 'normal' failure rate on the different platforms?


The failure rate should be 0%.


There was a 28% "failure" rate because 28% of the jobs went to AMD machines which aborted immediately. SSE4 is not supported by the AMD CPU. There are other inconsistencies in the instruction sets.

It is pretty easy to get to SSE2 because both AMD and Intel support SSE2 instructions similarly.

Once you go beyond SSE2, optimization becomes a little more tricky. That is why the BOINC environment provides the application with information that tells it what features the CPU support. It is displayed in the BOINC manager Event Log when BOINC starts up.



There are a number of things that affect the performance that is measured. Among things affecting performance.

1. Windows kicks the app into turbo mode aggressively and the newer CPU throttle back frequency if the CPU heats up to prevent damage (Sandybrdge and newer).

The old and new versions may not be running at the same frequency on the same machine. Not likely affecting David's results.


2. 8 32-bit registers versus 16 64-bit registers.
The Win 64 version uses the same 8 XMM registers in SCALAR mode (one compute at a time) that the 32-bit version does. The 64-bit performance comes from having 8 more registers to store temporary results and not having to STORE and LOAD them to/from memory.


After getting to SSE2, the next performance barrier would likely be to figure out why the application is running in SCALAR mode rather than VECTOR mode. Fixing the code so the compiler will generate vector code will double the performance during the times when these computations are currently happening. Without vector code or looking at the algorithms, there is little more that David can do beyond SSE2. The Intel ICC "-ax" dispatcher will automatically build "fat" binaries with code optimized for CPU features, but I suspect that code might still generate scalar code.
ID: 78467 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,573,506
RAC: 7,165
Message 78468 - Posted: 18 Jul 2015, 22:33:24 UTC - in response to Message 78467.  


There was a 28% "failure" rate because 28% of the jobs went to AMD machines which aborted immediately. SSE4 is not supported by the AMD CPU.


?? AMD support SSE4.2 since 2011..
ID: 78468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,548,147
RAC: 6,874
Message 78469 - Posted: 18 Jul 2015, 22:40:48 UTC

I'm trying to follow - were the failures on an AMD Phenom machine?

And is it worth sending some more out for a bigger sample of which machines they fail on?
ID: 78469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,012,385
RAC: 6,215
Message 78470 - Posted: 18 Jul 2015, 23:49:16 UTC - in response to Message 78468.  


There was a 28% "failure" rate because 28% of the jobs went to AMD machines which aborted immediately. SSE4 is not supported by the AMD CPU.


?? AMD support SSE4.2 since 2011..


boboviz ....
You are absolutely correct and my error. Thanks for pointing it out.
Sorry.

dcdc ....
The only errors that were reported publicly were ones on an AMD Phenom running 64-bit Linux. I ran about 80 tasks successfully on my machines so I made the leap-of-guess that anyone with an older AMD CPU would get tasks, they would quickly error-out and then the machine would ask for more. David E K would have to break down the exact machine types.




Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
SIGILL: illegal instruction
Stack trace (18 frames):
[0x3c04cbd]
[0x409f10]
[0x35b0473]
[0x376a461]
[0x37343a8]
[0x373498c]

...

[0x1fc556b]
[0x202cc6d]
[0x202bbc0]
[0x40de78]
[0x3cbb84b]
[0x4003e9]

Exiting...

As soon as the application has unpacked and begins to run via BOINC it fails

I am using AMD Phenom CPUs running 64 Bit Fedora Linux.
ID: 78470 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 78471 - Posted: 19 Jul 2015, 10:53:59 UTC - in response to Message 78468.  
Last modified: 19 Jul 2015, 10:55:15 UTC

?? AMD support SSE4.2 since 2011..


Applies to Phenom:

Gotta admit, I'm a bit confused here. Wikipedia says the supported instruction sets are:

MMX, Extended 3DNow!, SSE, SSE2, SSE3, SSE4a, AMD64, Cool'n'Quiet, NX bit, AMD-V

These instructions (SSE4a) are not available in Intel processors and it's not the same as SSE4.

Quote from the German Wikipedia:

Trotz des ähnlichen Namens hat SSE4a nichts mit Intels Befehlssatzerweiterung SSE4 zu tun. Die einzige Gemeinsamkeit besteht lediglich darin, dass beide auf SSE3 aufbauen.

means

Though having a simlar name it's got nothing to do with the instruction set SSE4. The only thing in common is that both have their foundation in SSE3.
ID: 78471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,548,147
RAC: 6,874
Message 78472 - Posted: 19 Jul 2015, 11:22:16 UTC

From CPU world (http://www.cpu-world.com/Glossary/S/SSE4.html):

SSE4 instruction set extension consists of 54 instructions that improve performance of media data manipulation and text processing. The first 47 instructions from the SSE4, called SSE4.1, were introduced in Intel Penryn core on January 7th, 2008. Support for remaining 7 instructions, or SSE4.2, was included into Nehalem core.

Support for SSE4 instructions was added in to all AMD microprocessors based on Bulldozer micro-architecture. AMD Bobcat, K10 and earlier CPUs do not support this extension.


So any intel chip with ix (or pentium/celeron derivatives) support full SSE4.2, and earlier chips from the 45nm shrink (so my E8400 but not my Q6600) support the majority of the instructions under SSE4.1.

From Wiki (https://en.wikipedia.org/wiki/SSE4):
The SSE4a instruction group was introduced in AMD's Barcelona microarchitecture. These instructions are not available in Intel processors.
ID: 78472 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,548,147
RAC: 6,874
Message 78473 - Posted: 19 Jul 2015, 11:50:05 UTC - in response to Message 78349.  

I'll also look into a VS upgrade.


According to this source, MS will release VS2015 during this summer... :-)

According to the link above, it's out tomorrow.

The Release candidate is available here: https://www.visualstudio.com/en-us/downloads/visual-studio-2015-downloads-vs.aspx

And it says: Free for open source projects, academic research, training, education and small professional teams.

Would that allow AVX to be tested?

ID: 78473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 78474 - Posted: 19 Jul 2015, 12:25:49 UTC - in response to Message 78472.  


Support for SSE4 instructions was added in to all AMD microprocessors based on Bulldozer micro-architecture. AMD Bobcat, K10 and earlier CPUs do not support this extension.


Phenom = K10 = no SSE4x

Problem solved?
ID: 78474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,548,147
RAC: 6,874
Message 78475 - Posted: 19 Jul 2015, 13:29:49 UTC - in response to Message 78474.  


Support for SSE4 instructions was added in to all AMD microprocessors based on Bulldozer micro-architecture. AMD Bobcat, K10 and earlier CPUs do not support this extension.


Phenom = K10 = no SSE4x

Problem solved?


I would guess you're right.

The K10 chips do have SSE4a which might cause confusion?

Is it down to the compiler to point Windows/Linux to the correct parts of the binary to use, given the CPU?
ID: 78475 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 78476 - Posted: 19 Jul 2015, 14:43:26 UTC - in response to Message 78475.  

Is it down to the compiler to point Windows/Linux to the correct parts of the binary to use, given the CPU?


I searched the Internet for an answer to that but only found information in regard to the Intel compiler.

I figure it will be similar with gcc.

Intel Compiler Vectorization


The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax flag, which takes the same options as the -x flag (i.e., AVX, ..., SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and will then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported. The baseline can be defined with the -x flag, with -xSSE2 recommended. Multiple -ax flags can be specified to create several options. For example, compile with -axAVX -axSSE4.2 -xSSE2. In this case, when run on an AMD Opteron processor, the baseline SSE2 execution path will be taken. When run on an Intel Westmere processor, the SSE4.2 execution path will be taken. When run on an Intel Sandy Bridge processor, the AVX execution path will be taken.


Not sure though...
ID: 78476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,012,385
RAC: 6,215
Message 78477 - Posted: 19 Jul 2015, 17:50:47 UTC - in response to Message 78476.  

Is it down to the compiler to point Windows/Linux to the correct parts of the binary to use, given the CPU?


I searched the Internet for an answer to that but only found information in regard to the Intel compiler.

I figure it will be similar with gcc.

Intel Compiler Vectorization


The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax flag, which takes the same options as the -x flag (i.e., AVX, ..., SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and will then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported. The baseline can be defined with the -x flag, with -xSSE2 recommended. Multiple -ax flags can be specified to create several options. For example, compile with -axAVX -axSSE4.2 -xSSE2. In this case, when run on an AMD Opteron processor, the baseline SSE2 execution path will be taken. When run on an Intel Westmere processor, the SSE4.2 execution path will be taken. When run on an Intel Sandy Bridge processor, the AVX execution path will be taken.


Not sure though...



I think that gcc 4.8 and following gcc versions try to do something similar but it is not clear to me what it does.
https://gcc.gnu.org/wiki/FunctionMultiVersioning
... and then a link from the above page ....
https://gcc.gnu.org/wiki/FunctionSpecificOpt

Getting to SSE2 is pretty easy and does not cause too much damage because SSE2 has been in CPU's for a decade. Going beyond SSE2 is tricky. The best "next step" beyond SSE2 is to make sure the source code will generate vector code. If you cannot arrange the source code to generate a vector binary, then it makes little sense to turn on higher options (AVX ....). You are still only going to crunch 1 value at a time. Vector code will allow you to crunch multiple values in parallel. The performance of all sections of code that convert to vector code will be multiplied by the vector size. That is why you try to use the DATA TYPE size that does the job and no larger. If you can get your answer using a FLOAT or INTEGER, then using a DOUBLE or a LONG reduces your performance by 1/2. You crunch a lot of 0's to generate 0's. 8-)









ID: 78477 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 78478 - Posted: 19 Jul 2015, 19:01:24 UTC

Looks like the developers need to get their hands on a Linux version of the Intel compiler

or accept SSE3 as the baseline

or screw all AMD K10 users.
ID: 78478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,548,147
RAC: 6,874
Message 78479 - Posted: 19 Jul 2015, 19:04:14 UTC

VS2015 says: Build for iOS, Android, Windows devices, Windows Server or Linux
ID: 78479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,012,385
RAC: 6,215
Message 78482 - Posted: 20 Jul 2015, 0:43:20 UTC - in response to Message 78479.  

VS2015 says: Build for iOS, Android, Windows devices, Windows Server or Linux


The Intel compiler plugs right into MS Visual Studio and that is really the preferred way I think they would like to have you use it. I don't know if it will install properly into VS15 but there is an academic program that U. of Wash. developers likely qualify for. Intel was one of the early supporters of BOINC.
Link ....
https://software.intel.com/en-us/qualify-for-free-software

VS15 itself will not solve the problem of generating the best code for "a particular" CPU.

Rosetta, however, knows what requesting system looks like and what binary they would like to ship to the user for best/reasonable results.

Look at the file in your BOINC data directory for a file named "sched_request_boinc.bakerlab.org_rosetta.xml".
The p_features section describes what capabilities your system CPU has. Rosetta can ask for an SSE4 application to be sent to my Haswell machine because it supports SSE4.

<p_features>fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 dca pbe fsgsbase bmi1 smep bmi2</p_features>

Rosetta must already parses this file to determine PLATFORM (Windows or Linux) and what kind of jobs I want to accept.

If they can handle TWO platforms, then this platform dispatcher could be expanded to include additional granularity of FEATURES supported.

ID: 78482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...



©2024 University of Washington
https://www.bakerlab.org