Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...
Author | Message |
---|---|
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
This thread is to bring exposure to the findings done by user rjs5 back in the thread "Rosetta@home using AVX / AVX2 ?": The executing code seems to be compiled for a i386 and uses the 387 floating point 8-register stack model. The code (on my machine) spends about 5% of the time waiting for the "fmul st0,st1" ("====" below) to complete.
I believe R@H uses the Rosetta Commons code, therefore I do not know precisely who really codes the Rosetta Software, but this should at least be looked at by someone working for R@H. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,573,506 RAC: 7,165 |
I believe R@H uses the Rosetta Commons code, therefore I do not know precisely who really codes the Rosetta Software, but this should at least be looked at by someone working for R@H. This is the documentation of Rosetta Commons. I don't know if the code is the same of r@h. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
To summarize, rjs5 used some software (Intel Vtune sampline profiler) to examine the binaries of the minirosetta core and discovered that they are being compiled using a very outdated version of the GCC, and in short, simply updating the compiler would introduce some optimizations and resolve some known bottlenecks that show up in any program built with the older version of GCC. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,012,385 RAC: 6,215 |
To summarize, rjs5 used some software (Intel Vtune sampline profiler) to examine the binaries of the minirosetta core and discovered that they are being compiled using a very outdated version of the GCC, and in short, simply updating the compiler would introduce some optimizations and resolve some known bottlenecks that show up in any program built with the older version of GCC. The tools needed on Linux are available to all Linux users. Just start up a bunch of R@H tasks, use "perf" to monitor all the system CPU's for your time period and use "perf" to display the results. I used "objdump" to disassemble the binary and find the "perf" program counter address in the objdump output. If you have SOURCE, objdump will add the source code to the dump. The equally good stuff on Windows seems to be mostly retail stuff. r |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,573,506 RAC: 7,165 |
It's a pity there are no admins in this thread... |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon. Thanks for the helpful input and suggestions for optimizations etc. |
Dirk Broer Send message Joined: 16 Nov 05 Posts: 22 Credit: 3,338,993 RAC: 1,539 |
I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon. Thanks for the helpful input and suggestions for optimizations etc. As SSE2 has been around since the Pentium 4 (2001), can we expect new versions with SSE3 (2004), SSSE3 (2006), SSE4 (2006), AES (2008), AVX (2008), F16C (AMD: 2009/Intel: 2001), and/or FMA instructions (2011-2013) at Ralph soon too? |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
For the immediate future, I can test whatever optimizations are possible given the version of visual studio we currently have which is 2010. |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
For the immediate future, I can test whatever optimizations are possible given the version of visual studio we currently have which is 2010. Looks like auto-vectorization is only supported by newer versions of Visual Studio, e.g. from 2012 onwards. That means no AVX2. I think SSE2 is the best we get. There's no way other than updating your compiler infrastructure. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,573,506 RAC: 7,165 |
I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon. Great!! |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I'll also look into a VS upgrade. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,573,506 RAC: 7,165 |
I'll also look into a VS upgrade. According to this source, MS will release VS2015 during this summer... :-) |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
RJS5 did his testing on Linux, and I believe he said that the version of GCC used for compiling the Linux binaries was also incredibly outdated (for some reason I want to say he mentioned something like it being 8+ versions behind) and that upgrading the GCC compiler on Linux would also render some easy performance improvement without any change to the code base. |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
I'm not too sure if the apps can be made available as 'additional binaries' i.e. we can have a 'lowest common denominator' made available to the general cohort. & there could be specific binaries that's optimised targetting the newer chips which for that matter may not even run on chips even a generation earlier. those binaries would probably not be automatically downloaded, but for those keen they can optionally install the binaries following some instructions. ------- on another note i found is that rosetta commons code is apparently available for 'no charge' only under 'academic license'. while a commercial license cost some 40k per site. this would probably limit the feasibility say for a public member to build '3rd party binaries' that could be used with rosetta@home https://c4c.uwc4c.com/express_license_technologies/rosetta just 2 cents |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
I'm not too sure if the apps can be made available as 'additional binaries' i.e. we can have a 'lowest common denominator' made available to the general cohort. Actually, all modern compilers support multiple code paths built into a single binary and handle this type of fall-back automatically. No need for all the complexity to be handled on the BOINC side. My comment was simply that not only should SSE2 be enabled in VS2010, but also that the Linux versions should be recompiled with an updated version of GCC rather than the very old version they appear to be built with. Cheers! |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I built a 64bit linux version with gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) with the "-msse4.2" option. Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions. Also keep in mind that the Rosetta code will likely not gain much from vectorization optimizations but any gain is good if it's just a matter of updating compiler options. Thanks for all your input! |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
I'm not too sure about the compiler flags either as aggressive optimization can also break things. I still recommend a newer version of gcc. The one you have is ancient. You should probably check out the "mtune" option. -O2 should be the maximum. Flags for gcc For fun and profit (if you own a Haswell CPU and a newer compiler) you could try compiling the source code with the -march=native flag and compare results with a non-AVX2 version. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
It would not just be for fun, these optimizations which our scientists and developers aren't that familiar with can have great benefits if the speed up is significant. I'll give that a try and see how things improve. The linux 64bit build with sse4.2 and gcc 4.4.7 does seem to have a more significant improvement than our windows sse2 version, around a 12% improvement on my quick test. I need to do more thorough tests though, particularly for the windows builds but judging from this linux improvement, it may be worthwhile to upgrade to VS 2015 when it comes out. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions. If you can, I'd really suggest reaching out to user rjs5 via a private message. He seems to have a strong technical understanding of the various compiler options more than most of us talking heads on these forums tend to, and he seemed very willing to help. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,012,385 RAC: 6,215 |
Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions. You (Timo) pinged me with a message through the board but I infrequently stop to pick them up. I think running the BETA program through RALPH is dumb. They could/should simply define a NEW "Beta OPT IN" project OPTION on this Rosetta board and build upon their current contributors. Including me and my Haswell machine. Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder. There will be a couple of barriers to get to real performance improvement. I think that getting an optimized version for Windows and Linux versions is a challenge that most people overlook. I have downloaded a number of other project sources to poke around but I have never really built a version because I did not want to mess up my running projects. For this one, I would work with David E. K. to see if I could help. You have to be careful about determining how much improvement you "got". Windows is VERY AGGRESSIVE about using TURBO mode and when you start optimizing Windows code, the CPU will heat up faster when you start switching more transistors and drop out of TURBO mode earlier. Your code gets faster but it overheats your system. Linux systems are FAR LESS AGGRESSIVE about using TURBO and the performance benefits of improving code are "more visible" to the person with the stop watch. I would watch the Windows CPU temperature and frequency with one of the number of TOOLS available. I tend to use SPECCY which has been good to me. https://www.piriform.com/speccy I loaded XSensors on my Linux system to monitor CPU temperature. If you want to watch the CPU temperature go NUTS, monitor the CPU temperature and run prime95 while watching the temp and frequency. PrimeGrid even apps stress my liquid cooled : Intel Core i7 5930K Cores 6 Threads 12 Name Intel Core i7 5930K Code Name Haswell-E/EP Package Socket 2011 LGA Technology 22nm Specification Intel Core i7-5930K CPU @ 3.50GHz The stages of performance improvement. 1. SSE2: The first will be to migrate to 64-bit floating point from the old x87 80-bit floating point. x87 80-bit was supported by Intel but not by any of their RISC competitors during the "RISC vs CISC" wars. x87 80-bit registers were truncated to 64-bits when stored to memory so depending on the code, you could truncate the 80-bit FP values to 64-bit values at various times in the computation, leading to error variation creeping into calculations at different rates. If they are able to get satisfactory results with the SSE2 options which TURN OFF the x87, then all other options are open. 2. VECTOR INSTRUCTIONS: The second level of optimization will then to be to make sure their algorithms are written so the compiler can VECTORIZE them. The SSE2 instructions operate on 128-bit XMM registers that can do 2 64-bit FP operations or 4 32-bit operations or 8 16-bit operation during the similar number of clocks. If they do 64-bit FP operations in a loop where 32-bit operations are OK, then they are losing 50% performance while executing the code. The performance loss percentage gets bigger as the size of the vector register increases. There are a number of things that can be done in the program to "encourage" the compiler to make the decision to vectorize the code automatically, and the compilers are getting better. The developer can also use instrinsic statements to force the compiler to use vector instructions. Intel has a Intrinsic Guide online at https://software.intel.com/sites/landingpage/IntrinsicsGuide/ There are also hand optimized libraries supported by Intel and open source groups (with the help of Intel) that developers can include in their code. Intel MKL and IPP libraries are, I think, available to educational institutions for distribution. 3. VECTOR SIZE: SSE2 and AVX will operate on 128-bit XMM registers. AVX2 will operate on the 256-bit YMM registers and the AVX2 added INTEGER vector instructions. AVX3 (SkyLake) Xeon PHI will operate on 512-bit vector registers. When using the VECTOR operations, the compiler will chose the SCALAR (x87-like do it one at a time) operations OR PARALLEL or PACKED operations that do multiple operations in parallel. The goal of the developer is to code the algorithm to use the PARALLEL or PACKED operations. Parallel Scalar ADDPS ADDSS - Adds operands SUBPS SUBSS - Subtracts operands MULPS MULSS - Multiply operands DIVPS DIVSS - Divides operands You want to write you code so it uses PACKED or PARALLEL operations. Scalar code will give you a few percent performance improvement. PARALLEL will give you MULTIPLE times performance improvement. |
Message boards :
Number crunching :
R@H Scientists/Coders: An analysis of the Rosetta binaries...
©2024 University of Washington
https://www.bakerlab.org