GPU computing

Author	Message
Mark Send message Joined: 10 Nov 13 Posts: 40 Credit: 397,847 RAC: 0	Message 80082 - Posted: 14 May 2016, 17:10:05 UTC Last modified: 14 May 2016, 17:23:58 UTC I've just been looking at the performance of the new GTX1080 and for DOUBLE precision calculations it does 4 Tflops!!!! For comparison a relatively high performance chip like an overclocked 5820K will do maybe 350GFlops. So we are talking an order of magnitude difference. In addition the Tesla HPC version will probably be double that at 8 TFlops. (Edit: Looks like it is actually 5.3TFlops) The Volta version of the gtx1080 (next gen on, due in about 18 months time) is rumoured to be 7TFlops FP64 in the consumer version. There is no way that conventional processors can keep up with that level of calculation. At what point does the gap between serial CPU and parallel GPU have to be before the project leaders decide they can not afford NOT to invest in recoding to parallel processing? Because by 2 years time, HPC GPUs will be around 35 times faster than CPUs. How much will it cost to rewrite the code, $100-150K maybe?? Isn't that worth paying for such a huge step up? With that kind of performance increase, you can make calcs more accurate. You no longer have to use approximations like LJ potentials, you can calculate the energy accurately and get a better answer in a quicker time than now. Whats not to like? It seems like so many projects, everyone is comfortable with what they are doing now. Revolution has been forsaken for evolution. Understandable, but the best way to do things? Be bold and take the leap! ID: 80082 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1836 Credit: 124,981,563 RAC: 265	Message 80083 - Posted: 14 May 2016, 23:20:17 UTC Last modified: 14 May 2016, 23:26:24 UTC It's been said a thousand times: The people who have looked have all said it's not viable. If you think you can get Rosetta to run faster on a GPU than on a CPU then offer your services - they've shown that they're willing to work with serious and capable people. GPUs are great for some stuff but most programs run faster on a CPU regardless of theatrical numbers for perfect hugely parallel workloads. Brute force through more, more efficient and faster cores is the current best option unless rjs5 works some magic. Hopefully Zen will deliver a cheap way to 16 fast threads. D ID: 80083 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 80084 - Posted: 14 May 2016, 23:31:05 UTC We basically have someone starting a thread like this one every 3-4 months. Rosetta@Home's protocols simply don't lend themselves well to GPU architecture. If you have a GPU floating around and want to do protein related research with it, POEM@Home and Folding@Home both do protein folding simulations on the GPU, but it's a fundamentally different problem they respectively tackle compared to Rosetta and thus their simulations can be done using common Molecular Dynamics libraries that are very much GPU friendly. ID: 80084 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2188 Credit: 13,720,774 RAC: 3,167	Message 80088 - Posted: 16 May 2016, 9:19:45 UTC - in response to Message 80084. We basically have someone starting a thread like this one every 3-4 months. That's true, but.... The latest "public" test of Rosetta on gpu, if i'm not wrong, was 4/5 years ago. Some things are changed both in hw and sw. :-) ID: 80088 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1836 Credit: 124,981,563 RAC: 265	Message 80089 - Posted: 16 May 2016, 9:54:00 UTC - in response to Message 80088. We basically have someone starting a thread like this one every 3-4 months. That's true, but.... The latest "public" test of Rosetta on gpu, if i'm not wrong, was 4/5 years ago. Some things are changed both in hw and sw. :-) The problem is the people who post saying it must be done are generally not capable of doing it or knowing whether it is do-able, or of doing the cost-benefit analysis as to whether it would be worthwhile even if it were viable. I think one of the main things that is often overlooked, is that a priority for Rosetta is the continual development of the capabilities of the software, rather than optimising it at any one point in time. ID: 80089 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2188 Credit: 13,720,774 RAC: 3,167	Message 80090 - Posted: 16 May 2016, 11:59:15 UTC - in response to Message 80089. Last modified: 16 May 2016, 12:00:22 UTC The problem is the people who post saying it must be done are generally not capable of doing it or knowing whether it is do-able, or of doing the cost-benefit analysis as to whether it would be worthwhile even if it were viable. I'm agree with you, I know the problem of gpu coding and i think that is important to have a "do-better" code for good science. But i also understand people that want a "do-faster" code to produce more results (with gpu or with cpu optimizations). Is it possible to have a "do-better-faster" code?? :-P ID: 80090 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 80091 - Posted: 16 May 2016, 21:42:08 UTC I believe rjs5 is doing some general review looking for parallel operations as the GPUs do so well, as he looks for CPU optimizations. Rosetta Moderator: Mod.Sense ID: 80091 · Rating: 0 · rate: / Reply Quote

Mark Send message Joined: 10 Nov 13 Posts: 40 Credit: 397,847 RAC: 0	Message 80092 - Posted: 17 May 2016, 13:26:35 UTC I get all the points made here. I'm not saying its easy. The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile. The Rosetta code got upgraded in a big project (to C++ I think) a while back. I am talking about a similar effort. Yes, I realise it will take time and ultimately money. Anyone have a feel for how much it would cost? I plucked a figure out of the air a bit, was it reasonable? ID: 80092 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1836 Credit: 124,981,563 RAC: 265	Message 80093 - Posted: 17 May 2016, 18:58:05 UTC - in response to Message 80092. I get all the points made here. I'm not saying its easy. The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile. I'm not sure that it ever does, unless either the compiler does the work to make the code work on GPU, or there are parts of the code that are largely static so that the ongoing code development isn't hindered/complicated. The Rosetta code got upgraded in a big project (to C++ I think) a while back. I am talking about a similar effort. Yes, I realise it will take time and ultimately money. Anyone have a feel for how much it would cost? I plucked a figure out of the air a bit, was it reasonable? I think that's right - I think they moved from Fortran(?). I think rjs5 is the best qualified/positioned to answer that one (although putting a cost to it might not be possible). D ID: 80093 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2188 Credit: 13,720,774 RAC: 3,167	Message 80095 - Posted: 18 May 2016, 10:13:03 UTC - in response to Message 80092. Last modified: 18 May 2016, 10:14:00 UTC The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile. With the becoming generation of gpu (Nvidia and Amd are presenting new families with new productive process 16/14 nm) the gap between cpu and gpu will be incredible, but.... - Not all the sw can be used by gpus. - Rosetta now is in C++ but, if i'm not wrong, has some parts/library/etc still in Fortran (or Fortran-like). - Rosetta team is focused on scientific part of the code, not in optimization/gpu/whatsoever. The only one who are working on code is, indeed, a volunteer. - They are using old compilers, OS server, boinc server, etc and seems not so interested to update. I think a solution may be a graduate/phd/etc in computer science who works, in the Rosetta team, ONLY on optimization of the code. Other solution may be an open source code of Rosetta. Only admins can decide. ID: 80095 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 80097 - Posted: 18 May 2016, 15:06:21 UTC - in response to Message 80095. Last modified: 18 May 2016, 15:07:35 UTC The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile. With the becoming generation of gpu (Nvidia and Amd are presenting new families with new productive process 16/14 nm) the gap between cpu and gpu will be incredible, but.... 1- Not all the sw can be used by gpus. 2- Rosetta now is in C++ but, if i'm not wrong, has some parts/library/etc still in Fortran (or Fortran-like). 3- Rosetta team is focused on scientific part of the code, not in optimization/gpu/whatsoever. The only one who are working on code is, indeed, a volunteer. 4- They are using old compilers, OS server, boinc server, etc and seems not so interested to update. I think a solution may be a graduate/phd/etc in computer science who works, in the Rosetta team, ONLY on optimization of the code. Other solution may be an open source code of Rosetta. Only admins can decide. 1- GPUs are essentially very wide, heterogeneous "AVX" registers. You ship a vector of data to the GPU and it crunches MANY (hundreds/thousands) at once ... and then retrieve the results. The overhead of the transfers has to be small compared to the benefit. 2- It appears to me that Rosetta (or major chunks) started out Fortran and then were converted to C++. I am not a C++ programmer but it appears the programmer or the converter tool went slightly overboard on the templates and made some fundamental mistakes in the data structure design. Rosetta is based on an XYZ vector data element. All Rosetta XYZ operations are perform on X, then Y and then Z. IF they changed the XYZ to an XYZW 4 element structure, the compilers could be encouraged to perform operations on the XY pair, then the ZW pair ... for a 50% improvement with SSE. AVX2 could perform operations on XYZW combined elements for a 75% speedup ON THOSE SECTIONS OF CODE. This is what I am looking at. I speak "C" with an "Assembler accent" and I am looking at C++ through "very thick glasses" using a C-to-C++ translator. I do OK with C but C++ is new. Crunching an XYZW vector coordinate is attainable without much Rosetta modification. I am not familiar enough to guess where the next step in parallelism might be ... but the XYZW conversion would be a first step in either case anyway. 3- I can testify that this is accurate. With 800k users and 1.7mil hosts, they cannot afford to speed Rosetta up too much or they will melt the internet transferring the 279mb database files. 8-) I am also looking at the impact and issues with partitioning Rosetta protocols. 4- The Rosetta BOINC server SW could use updating. They are building Rosetta now with newer versions of SW that are doing a pretty good job. Their HW is pretty dated. I suspect that the HW is groaning under the weight of its success ... assuming their equipment/server descriptions are current (other than the typos GB vs. GHz). The only thing that would cost more than a few $k to upgrade would be the 48 x 600GB disk drives on the GPFS SAN fileserver. It really depends on the system loading patterns and where the bottlenecks are. You could probably make a noticeable difference performance with $5 of equipment judiciously applied. Rosetta@home Hardware Web servers: boinc, srv1, srv2, srv3, srv4, srv5 4-core Dell R210s mirrored 146GB 15K RPM system disks 8 GB RAM Database server Dell 2950, dual-quad 2GB Xeon E5405 w/ 6MB cache 32 GB RAM mirrored root raid, running 2.6.18-92.1.22.el5 (64 bit) the database files are written to a hardware mirror of two 15K 300GB disks dual GB NIC GPFS SAN fileserver two IBM x3650 one IBM (1)DS3512/(3)EXP3512 disk controllers 48 600GB 15,000 RPM SAS disks clustered fileserver running IBM's GPFS filesystem ID: 80097 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 80105 - Posted: 19 May 2016, 15:19:28 UTC - in response to Message 80097. The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile. With the becoming generation of gpu (Nvidia and Amd are presenting new families with new productive process 16/14 nm) the gap between cpu and gpu will be incredible, but.... 1- Not all the sw can be used by gpus. 2- Rosetta now is in C++ but, if i'm not wrong, has some parts/library/etc still in Fortran (or Fortran-like). 3- Rosetta team is focused on scientific part of the code, not in optimization/gpu/whatsoever. The only one who are working on code is, indeed, a volunteer. 4- They are using old compilers, OS server, boinc server, etc and seems not so interested to update. I think a solution may be a graduate/phd/etc in computer science who works, in the Rosetta team, ONLY on optimization of the code. Other solution may be an open source code of Rosetta. Only admins can decide. 1- GPUs are essentially very wide, heterogeneous "AVX" registers. You ship a vector of data to the GPU and it crunches MANY (hundreds/thousands) at once ... and then retrieve the results. The overhead of the transfers has to be small compared to the benefit. 2- It appears to me that Rosetta (or major chunks) started out Fortran and then were converted to C++. I am not a C++ programmer but it appears the programmer or the converter tool went slightly overboard on the templates and made some fundamental mistakes in the data structure design. Rosetta is based on an XYZ vector data element. All Rosetta XYZ operations are perform on X, then Y and then Z. IF they changed the XYZ to an XYZW 4 element structure, the compilers could be encouraged to perform operations on the XY pair, then the ZW pair ... for a 50% improvement with SSE. AVX2 could perform operations on XYZW combined elements for a 75% speedup ON THOSE SECTIONS OF CODE. This is what I am looking at. I speak "C" with an "Assembler accent" and I am looking at C++ through "very thick glasses" using a C-to-C++ translator. I do OK with C but C++ is new. Crunching an XYZW vector coordinate is attainable without much Rosetta modification. I am not familiar enough to guess where the next step in parallelism might be ... but the XYZW conversion would be a first step in either case anyway. 3- I can testify that this is accurate. With 800k users and 1.7mil hosts, they cannot afford to speed Rosetta up too much or they will melt the internet transferring the 279mb database files. 8-) I am also looking at the impact and issues with partitioning Rosetta protocols. 4- The Rosetta BOINC server SW could use updating. They are building Rosetta now with newer versions of SW that are doing a pretty good job. Their HW is pretty dated. I suspect that the HW is groaning under the weight of its success ... assuming their equipment/server descriptions are current (other than the typos GB vs. GHz). The only thing that would cost more than a few $k to upgrade would be the 48 x 600GB disk drives on the GPFS SAN fileserver. It really depends on the system loading patterns and where the bottlenecks are. You could probably make a noticeable difference performance with $5 of equipment judiciously applied. Rosetta@home Hardware Web servers: boinc, srv1, srv2, srv3, srv4, srv5 4-core Dell R210s mirrored 146GB 15K RPM system disks 8 GB RAM Database server Dell 2950, dual-quad 2GB Xeon E5405 w/ 6MB cache 32 GB RAM mirrored root raid, running 2.6.18-92.1.22.el5 (64 bit) the database files are written to a hardware mirror of two 15K 300GB disks dual GB NIC GPFS SAN fileserver two IBM x3650 one IBM (1)DS3512/(3)EXP3512 disk controllers 48 600GB 15,000 RPM SAS disks clustered fileserver running IBM's GPFS filesystem I can't fathom the computing knowledge you need for something like Rosetta. Or anything useful for that matter... I just got into learning Python (I figured an EE should know a good bit of programming) and I'm struggling like mad. MATLAB is the only language I'm proficient at, but it's so user friendly it doesn't count IMO. ID: 80105 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2188 Credit: 13,720,774 RAC: 3,167	Message 80106 - Posted: 19 May 2016, 16:13:37 UTC - in response to Message 80105. Last modified: 19 May 2016, 16:14:16 UTC I can't fathom the computing knowledge you need for something like Rosetta. Or anything useful for that matter... I just got into learning Python (I figured an EE should know a good bit of programming) and I'm struggling like mad. MATLAB is the only language I'm proficient at, but it's so user friendly it doesn't count IMO. If i remember correctly, the public test of rosy on gpu was with and old version of pycl This is the post one developer wrote about this test. It's a pity that pdfs are not longer available ID: 80106 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2188 Credit: 13,720,774 RAC: 3,167	Message 80172 - Posted: 13 Jun 2016, 7:00:16 UTC - in response to Message 80082. I've just been looking at the performance of the new GTX1080 and for DOUBLE precision calculations it does 4 Tflops!!!! I'm curious to see the upcoming RX480. Over 5 Tflops SP with 199$ and 150W. ID: 80172 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 80173 - Posted: 13 Jun 2016, 17:46:32 UTC - in response to Message 80097. 3- I can testify that this is accurate. With 800k users and 1.7mil hosts, they cannot afford to speed Rosetta up too much or they will melt the internet transferring the 279mb database files. 8-) Every host is downloading this file just once when a new version is released. More efficient application code won't change anything here. I guess most WUs still finish before reaching the max decoys allowed, so even the amout of other files downloaded should not change much (except for some fast hosts with long target runtimes), eventually the result files will become larger. . ID: 80173 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2188 Credit: 13,720,774 RAC: 3,167	Message 80434 - Posted: 26 Jul 2016, 8:27:00 UTC The old problem of Rosy on gpu is the gpu memory. Problem solved!!! AMD Radeon Pro SSG ID: 80434 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2188 Credit: 13,720,774 RAC: 3,167	Message 80705 - Posted: 5 Oct 2016, 19:03:16 UTC Poem@Home is closing. Two things: - A lot of gpu power will be free - Source code of opencl app of protein folding will be released to public ID: 80705 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 80706 - Posted: 5 Oct 2016, 19:28:45 UTC First things first 1) folding@home for YOUR GPU and probably soon for your CPU, too (AVX support) through GROMACS. 2) but AVXx support first for rosetta. Is there any progress worth speaking of? It's interesting: in the past they (poem@home) have started to recruit external personnel in order to optimize their app. Mr. Tankovich obviously did a great job: They no longer depend on the donations of pesky contributors like us. In the long run (John Maynard Keynes: "In the long run we are all dead...") I predict this solution for rosetta, too. There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge. So unless there is another thing like Charity Engine coming along the way, I think that instead of having to rely on a legion of desktop-grade CPUs, it would be more wise to spend all the resources on a central cluster of AVXx-powered servers under central control. ID: 80706 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 80707 - Posted: 5 Oct 2016, 21:15:11 UTC - in response to Message 80706. Last modified: 5 Oct 2016, 21:28:23 UTC So unless there is another thing like Charity Engine coming along the way, I think that instead of having to rely on a legion of desktop-grade CPUs, it would be more wise to spend all the resources on a central cluster of AVXx-powered servers under central control. R@h is currently running on just a hand full of machines as it has for over 10 years. It's not a significant resource burden for the amount of volunteer computing and scientific progress. The lab also has access to central clusters and super computing resources granted to scientists. -- forgot to mention R@h also runs the automated structure prediction jobs from the public server, Robetta. ID: 80707 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2188 Credit: 13,720,774 RAC: 3,167	Message 80708 - Posted: 5 Oct 2016, 21:35:49 UTC - in response to Message 80706. There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge. I hope not. If it is "kludge", why you are here? I think that instead of having to rely on a legion of desktop-grade CPUs, it would be more wise to spend all the resources on a central cluster of AVXx-powered servers under central control. It depends how many money and resources you have, particulary if you have only cpu code. ID: 80708 · Rating: 0 · rate: / Reply Quote