Message boards : Number crunching : Will there be a 64-bit client in the near future?
Previous · 1 · 2
Author | Message |
---|---|
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
The folks over at folding@home / GROMACS seem to put a little more effort in it... 14:37:02:WU00:FS00:FahCore 0xa7 started 14:37:03:WU00:FS00:0xa7:*********************** Log Started 2016-11-11T14:37:02Z *********************** 14:37:03:WU00:FS00:0xa7:************************** Gromacs Folding@home Core *************************** 14:37:03:WU00:FS00:0xa7: Type: 0xa7 14:37:03:WU00:FS00:0xa7: Core: Gromacs 14:37:03:WU00:FS00:0xa7: Website: http://folding.stanford.edu/ 14:37:03:WU00:FS00:0xa7: Copyright: (c) 2009-2016 Stanford University 14:37:03:WU00:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com> 14:37:03:WU00:FS00:0xa7: Args: -dir 00 -suffix 01 -version 704 -lifeline 3376 -checkpoint 15 -np 8 14:37:03:WU00:FS00:0xa7: Config: <none> 14:37:03:WU00:FS00:0xa7:************************************ Build ************************************* 14:37:03:WU00:FS00:0xa7: Version: 0.0.11 14:37:03:WU00:FS00:0xa7: Date: Sep 20 2016 14:37:03:WU00:FS00:0xa7: Time: 06:40:11 14:37:03:WU00:FS00:0xa7: Repository: Git 14:37:03:WU00:FS00:0xa7: Revision: 957bd90e68d95ddcf1594dc15ff6c64cc4555146 14:37:03:WU00:FS00:0xa7: Branch: master 14:37:03:WU00:FS00:0xa7: Compiler: GNU 4.8.5 14:37:03:WU00:FS00:0xa7: Options: -std=gnu++98 -O3 -funroll-loops -ffast-math -mfpmath=sse 14:37:03:WU00:FS00:0xa7: -fno-unsafe-math-optimizations -msse2 14:37:03:WU00:FS00:0xa7: Platform: linux2 4.6.0-1-amd64 14:37:03:WU00:FS00:0xa7: Bits: 64 14:37:03:WU00:FS00:0xa7: Mode: Release 14:37:03:WU00:FS00:0xa7: [color=red][b]SIMD: avx_256[/b][/color] |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
agreed and as observed earlier just like in the case of the whetstone benchmark, 32bits codes actually turned out *faster* than 64bits codes. that could imply for instance that those running 32 bits boinc client actually get more boinc points (credits) since boinc awards points based on the whetstone benchmark, imho a very *crude* approach but i'd guess the point is to be 'comparable' across different boinc projects. 32 bits codes and its data use possibly considerably *less memory* compared to 64 bits. it makes it more likely that the code & data running in 32 bits fits completely in the cpu cache. this can make a world of difference if say the whetstone benchmark & all its data runs completely within the cpu cache never hitting dram. my thoughts are that 64 bits codes used so much more memory that cpu cache is trashed making it necessary to move data to / from memory for all that computations which hit the whetstone benchmark. as it seemed the whetstone benchmark is also impossible or infeasible to use parallel features such as SSE/AVX/AVX2 as most of its operations are memory based and in addition the codes / formulars seemed deliberately organised to thwart parallelism. Apparently for X86 or X86_64 CPUs, all the magic of floating point performance is primarily in the *FPU* and for that matter 32 bits codes or 64 bits codes do not matter as the *FPU* itself determines *all the floating point calculation performance*. Intel CPUs especially the recent cpus apparently has a *very fast high performance FPU* that accounts for the fast double precision floating point calculations performance. It is comparable to / outperforms even the lower & mid range consumer GPUs and delivers better double precision performance per watt compared to those GPUs. For the whetstone benchmark, the (intel) CPU seem to be able to handle double precision floating point calc so rapidly taking hardly any clock cycles that the cpu spent most of the time moving data between memory and cpu registers / cache. i'd guess the memory movements could have taken much more clock cycles (say 10 clock cycles?) compared to the double precision floating point computation itself say 1 clock cycle. that seem to be reflected in how the whetstone benchmark gets more gflops in 32 bits vs 64 bits |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
here is a very interesting article / slides on *AVX/AVX2*, and from CERN the HPC (high performance computing) people who deal with *physics* Haswell Conundrum:AVX or not AVX? https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf in 2014 Conclusions
they are in Boinc too & u can run their simulations: http://atlasathome.cern.ch/ that *special scenario* is apparently things like Linpack benchmark that depends heavily on subroutine DGEMM (double precision general matrix multiplication), e.g. multiply very *big/large* *square matrices* say 10,000 x 10,000 https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493/ once the math scenario falls outside this DGEMM multiply very big square matrices use case, all that vector / parallel cpu and even those extreme speed GPU (*petaflops*) hardware is simply *useless*, e.g. if you are trying to solve 2x2 matrices a billion times and the result of the next iteration depend on the previous, it would be just as slow as if you simply do it in loops no SSE,AVX,AVX2 lol in short SSE/AVX to all those super high end vectorized extreme performance GPU is only good if the whole world is simply DGEMM. too bad DGEMM is just very few of true real world scenarios lol |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
10 years ago a volunteer asked for 64 bit and he did one experiment with rosetta. Now, in 2017, i think it's time to start with tests on 64 bit app (and, maybe, optimizations) |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
imho microprocessors has reached the point of zero marginal improvements at 64 bits, based on the concept of Amdahl's law https://en.wikipedia.org/wiki/Amdahl's_law microprocessors is only as fast as its slowest bottleneck that cannot be made faster. Intel (and possibly AMD) X86* microprocessors with all its advances has very fast FPUs (even if the core codes runs at 32 bits, these FPUs runs at 80 bits all the time, perhaps even when SSE/AVX/AVXn is used) https://en.wikipedia.org/wiki/X87 the hardware optimizations (e.g. including superscalar instruction level parallelism) has perhaps reach a point may be it takes a single clock cycle to do some simple double precision floating point computations, and it can go no faster than that, and all the other chain of instructions and data processing is simply memory limited in its processing speed, and for some things you either vectorise and parallel process it or if that cannot be done such as the next step depends on the output of the previous step, this is as fast as it ever get (the early 64bits AMD64 processors worked faster 64bits vs 32bits, because those recent hardware optimizations (e.g. skylake, kabylake) that automatically fetch data say 64 bits doubles in hardware completely bypassing the old 32 bit ancient logic has not been designed/invented back then. today hardware prefetch instructions and use cache efficiently performing speculative execution overcoming all the 32 bits vs 64 bits distinction) and with transistors reaching 13nm (10nm next), it's probably impossible to make those transistors any smaller (the quantum uncertainty principal comes into play & it may no longer be possible to keep a transistor working as a transistor any smaller) - end of moores law this is an oversimplification, but it bring across the point that zero marginal improvements has been reached. hardware and software optimizations and transistor sizes has been pushed to limits where in there can be no further improvements regardless if it is 32 or 64 bits or more i think this is true even in ARM microprocessors where 64bits microprocessors only show a marginal (or even no) improvement over 32bits ARM microprocessors http://www.roylongbottom.org.uk/linpack%20results.htm#anchorAndroid Raspberry Pi 2 gcc 4.8 DP SP CPU MHz Linux MFLOPS MFLOPS ARM V7A 1000 3.18.5 169 176 Raspberry Pi 3 gcc 4.8 DP SP CPU MHz Linux MFLOPS MFLOPS ARM v8-A53 1200 4.1.19 180 194 if we take 32 bits Raspberry Pi 2 and overclock that to 1.2 ghz then assuming 169 * 1.2 = 202.8 Mflops that implies 32 bits ARM could actually exceed that in 64 bits double precision linpack benchmark clock-for-clock |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
the main advantage for going 64bits is then more to do with memory than with processing speeds, it can easily address > 4GB boundaries. the downside is that 64bits apps consume more memory for the same app |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
the main advantage for going 64bits is then more to do with memory than with processing speeds, it can easily address > 4GB boundaries. the downside is that 64bits apps consume more memory for the same app This is my idea. I'm thinking of ram, to crunch bigger simulations (with, for example, the possibility to select "big app" in user's profile). For performances, eternal waiting of SSEx, Avx, Fma.. :-) |
Message boards :
Number crunching :
Will there be a 64-bit client in the near future?
©2024 University of Washington
https://www.bakerlab.org