Message boards : Number crunching : GPU Potential
Author | Message |
---|---|
SuperSluether Send message Joined: 7 Jul 14 Posts: 10 Credit: 1,357,990 RAC: 0 |
From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home? |
Mark Send message Joined: 10 Nov 13 Posts: 40 Credit: 397,847 RAC: 0 |
From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home? This is a question that gets regularly asked. The answer is very hard and not top of the priorities |
Jesse Viviano Send message Joined: 14 Jan 10 Posts: 42 Credit: 2,700,472 RAC: 0 |
From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home? The way Rosetta@home folds proteins is extremely serial in which each step of creating a decoy, or a guess as to what a protein that is folded would be shaped like, feeds upon the previous step. The only steps that do not feed upon each other is creating a new decoy after the previous decoy is finished. There is not much to parallelize in this program, so a GPU would fail in this job. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
There is not much to parallelize in this program, so a GPU would fail in this job. Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement. This is the reason i hope for AVX/AVX2 extensions, probably easier to implement and with an immediate gain (look at the first 100 cpus in Statistics) |
alex Send message Joined: 21 Dec 14 Posts: 8 Credit: 2,669,706 RAC: 40 |
Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient! |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient! But: The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. |
Jesse Viviano Send message Joined: 14 Jan 10 Posts: 42 Credit: 2,700,472 RAC: 0 |
Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient! AMD's latest CPUs can support both FMA3 and FMA4. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
In other news... Red Hat Engineer Improves Math Performance of Glibc - this might mean some increased performance is just a compiler-upgrade away (no idea who does actual builds of the rosetta core and what compilers are currently in use - interesting optimizations to be had anyways) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement. Gerasim@home, for example, doesn't use cuda or opencl, but C++ AMP They pass, in few days, from less than 30 GFlops to over 300 GFlops.... |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement. i'm not too sure if those who're 'blessed' with *very expensive* Intel MKL could relink rosetta against Intel MKL (and if and only if rosetta happens to use specific calls and algorithms that happens to be optimized by MKL) and i'd also think it may perhaps only specifically benefit those who own a recent cpu e.g. ivy bridge or haswell) may see the benefits. there's also the ACML from amd which presumably biased to benefit amd platforms better but i'd guess doing so could be deemed too 'biased' as all the rest who don't benefit from those hardware would complain they are 'left in the cold' :o :p lol i'd guess while the benefits are there, there's simply many varied platforms to support (this is true even for cuda / opencl) and not all platforms (i.e. GPUs) support the features that's necessary for accelerated computation. cuda / opencl is known for drastically reduced *double precision floating point* performance vs single precision floating point (read some could cut as much as 1/8 vs single precision that the cards handle) & those GPUs that can handle *accelerated double precision floating point* well are probably *very expensive*. let alone the fact that it may require significant rework (using totally different methods) to just get the performance gains. http://www.cs.virginia.edu/~mwb7w/cuda_support/double.html On the GTX 280 & 260, while a multiprocessor has eight single-precision floating point ALUs (one per core), it has only one double-precision ALU (shared by the eight cores). Thus, for applications whose execution time is dominated by floating point computations, switching from single-precision to double-precision will increase runtime by a factor of approximately eight i'd guess for *complex problems* that has a lot of *iterative dependencies* (e.g. the next iteration depends on the results of a prior iteration and which those results cannot be predicted) , there is also a significant limit in that it could be impossible to parallelize or that the parallelized results may lead to wrong answers http://en.wikipedia.org/wiki/Amdahl%27s_law |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
this is a somewhat 'techie/geeky' post ignore if u do find it that way :o lol apparently to AMD / Nvidia etc the GPU designs has pushed the envelope of 'old school' technology. The GPU today from the higher end AMD / Nvidia etc GPU is effectively today's *vector CPU* (along the notions of the earlier vector supercomputers - cray etc) this is most apparent from the Vector ALU instructions in this AMD technical document which i'd think would be utilised if technologies such as Open-CL / CUDA are used: http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf AMD SOUTHERN ISLANDS SERIES TECHNOLOGY Table 6.2 Vector ALU Instruction Set ... V_MSAD_U8 V_XOR_B32 V_RSQ_LEGACY_F32 1-2 Operand Instructions Available Only in VOP3 V_BFM_B32 V_RSQ_{F32, F64} V_ADD_F64 V_MAC_F32 V_SQRT_{F32,F64} V_MUL_F64 V_MADMK_F32 V_SIN_F32 V_MIN_F64 V_MADAK_F32 V_COS_F32 V_MAX_F64 V_BCNT_U32_B32 V_NOT_B32 V_LDEXP_F64 V_MBCNT_LO_ U32_B32 V_BFREV_B32 V_MUL_{LO,HI}_{I32,U32} V_MBCNT_HI_U32_B32 V_FFBH_U32 V_LSHL_B64 V_ADDC_U32 V_FFBL_B32 V_LSHR_B64 V_SUBB_U32 V_FFBH_I32 V_ASHR_I64 V_SUBBREV_U3 ... lots more That's significantly more elaborate than what used to be understood of as 'GPU' (graphic processing units). In effect the GPU packs the punch of parallel (SIMD single instruction multiple data) computations into what's traditionally done in CPU. and apparently today GPU pack possibly dozens to even hundreds of such vector ALU cores per GPU for such vector accelerated computations. i.e. e.g. to leverage today's AMD's/Nvidia platform in particular the 'GPU' vector ALU instructions/technologies it apparently means that it is necessary to re-write /redesign programs to use these technologies (e.g. Open-CL/ CUDA) this is unfortunate as much of today's programs use 'traditional' intel x86* style instructions and much less is designed for vector computations. the other thing is that GPUs often run at lower frequencies (e.g. 1 Ghz) compared to today's CPUs say 3-4 Ghz. thus on many of the 'benchmarks' web sites that pits Intel vs AMD CPUs etc, the apparent lack of prowess on AMD CPUs may apparently be simply that the 'benchmarks' programs are comparing X86 instructions prowess which probably puts AMD at a 'disadvantage' as AMD's design seemed to be more optimized towards vector computations. If vectorized open CL based computation are compared against a Intel CPU tasks for a program optimised for the different platform GPU vs CPU that is intended to produce the same results, i'd guess the GPU - CPU benchmarks say in terms of Gflops processing prowess would likely be closer and likely exceeded that of CPU prowess for the *high end* GPUs) unfortunately *high end* is needed as a qualifier here as a lot of 'lower end' GPUs use software emulation for double precision float vector computation that cripple the performance to say 1/8 of that possible for single precision floats. And the 'high end' GPUs are likely the *expensive* GPUs today |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
just for curiosity sake, today's home desktop teraflop (trying too hard to be petaflop) vector super computer? :o lol http://en.wikipedia.org/wiki/Radeon_HD_9000_Series Radeon R9 295X2 [44] Apr 8, 2014 Vesuvius GCN 1.1 28 2× 6200 2× 438 PCIe 3.0 ×16 2× 4096 1018 N/A 1250 (Effective 5000) 2× 2816:176:64 2× 65.152 2× 179.168 2× 512 GDDR5 2× 320 http://www.pcworld.com/article/2140581/amd-radeon-r9-290x2-review-if-you-have-the-cash-amd-has-the-compute-power.html |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
i'm not too sure if those who're 'blessed' with *very expensive* Intel MKL could relink rosetta against Intel MKL (and if and only if rosetta happens to use specific calls and algorithms that happens to be optimized by MKL) and i'd also think it may perhaps only specifically benefit those who own a recent cpu e.g. ivy bridge or haswell) may see the benefits. there's also the ACML from amd which presumably biased to benefit amd platforms better That's for sure. But, for example SSEx extension is cross-platform Amd/Intel (like avx) Some projects use these (Seti, Poem, Einstein, ecc). http://boinc.berkeley.edu/trac/wiki/AppPlan But only admins can answer to our questions |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
Now also Gpugrid has his opencl client (besides Poem@home). |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
New (provisional) version of OpenCl (2.1) supports C++ and, thank to new SPIR, a lot of languages. Kronos OpenCl |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
I think Rosetta might be able to harness GPUs once this whole "unified memory" thing comes out from nVidia and AMD. AFAIK, the main problem with running Rosetta on a GPU is RAM. Just imagine, a single WU uses about 0.5 GB. But with unified memory, the GPU should be able to access system RAM and CPU should be able to access VRAM directly. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
AFAIK, the main problem with running Rosetta on a GPU is RAM. Just imagine, a single WU uses about 0.5 GB. But with unified memory, the GPU should be able to access system RAM and CPU should be able to access VRAM directly. Amd Carrizo Apu (first cpu HSA Compliant) may be a first solution. "Unified memory" is now, i think, a beta-technology, but in the future.. |
Message boards :
Number crunching :
GPU Potential
©2024 University of Washington
https://www.bakerlab.org