Will there be a 64-bit client in the near future?

Message boards : Number crunching : Will there be a 64-bit client in the near future?

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 80837 - Posted: 11 Nov 2016, 15:46:49 UTC

The folks over at folding@home / GROMACS seem to put a little more effort in it...

14:37:02:WU00:FS00:FahCore 0xa7 started
14:37:03:WU00:FS00:0xa7:*********************** Log Started 2016-11-11T14:37:02Z ***********************
14:37:03:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
14:37:03:WU00:FS00:0xa7:       Type: 0xa7
14:37:03:WU00:FS00:0xa7:       Core: Gromacs
14:37:03:WU00:FS00:0xa7:    Website: http://folding.stanford.edu/
14:37:03:WU00:FS00:0xa7:  Copyright: (c) 2009-2016 Stanford University
14:37:03:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
14:37:03:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 704 -lifeline 3376 -checkpoint 15 -np 8
14:37:03:WU00:FS00:0xa7:     Config: <none>
14:37:03:WU00:FS00:0xa7:************************************ Build *************************************
14:37:03:WU00:FS00:0xa7:    Version: 0.0.11
14:37:03:WU00:FS00:0xa7:       Date: Sep 20 2016
14:37:03:WU00:FS00:0xa7:       Time: 06:40:11
14:37:03:WU00:FS00:0xa7: Repository: Git
14:37:03:WU00:FS00:0xa7:   Revision: 957bd90e68d95ddcf1594dc15ff6c64cc4555146
14:37:03:WU00:FS00:0xa7:     Branch: master
14:37:03:WU00:FS00:0xa7:   Compiler: GNU 4.8.5
14:37:03:WU00:FS00:0xa7:    Options: -std=gnu++98 -O3 -funroll-loops -ffast-math -mfpmath=sse
14:37:03:WU00:FS00:0xa7:             -fno-unsafe-math-optimizations -msse2
14:37:03:WU00:FS00:0xa7:   Platform: linux2 4.6.0-1-amd64
14:37:03:WU00:FS00:0xa7:       Bits: 64
14:37:03:WU00:FS00:0xa7:       Mode: Release
14:37:03:WU00:FS00:0xa7:       [color=red][b]SIMD: avx_256[/b][/color]
ID: 80837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80838 - Posted: 12 Nov 2016, 5:02:17 UTC - in response to Message 80836.  
Last modified: 12 Nov 2016, 5:17:35 UTC


There is little if any code that would currently benefit from 64-bit integers. I originally thought that the larger number of registers available in 64-bit mode would help but the increased code and data size of 64-bit code did more damage to the caches than registers SPILL/FILLS necessary in 32-bits (caused by fewer registers). That is what I measured when I actually recompiled the code as 32-bit AND 64-bit.


Rosetta spends a large chunk of its time computing "relationships" between 2 points in 3-dimensions (using floating point math).
Rosetta makes an X-dimension 64-bit floating point calculation.
Rosetta makes an Y-dimension 64-bit floating point calculation.
Rosetta makes an Z-dimension 64-bit floating point calculation.

You can change the TYPE DEFINITION of that "point" description to just add 4th "dummy" dimension that will allow the compiler to do a SIMD vector load of all 4 dimensions, perform the operation on all 4 dimensions and then a SIMD vector store. The compiler will change the 3 sequential SCALAR operation on 3-dimensions to a SINGLE PARALLEL operation on 4-dimensions.

If you add the 4th dimension in the TYPEDEF, you do not need to make ANY other source code changes for the compilers to automatically generate the low level code to perform the parallel LOAD-OPERATION-STORE. VERY low hanging fruit.

The Rosetta developers said they were already "familiar" with this technique when I pointed it out last year. It WOULD be their first, easy step to take if "low" performance was a problem for them.

32-bit integer versus 64-bit integer code really makes no difference unless Rosetta code undergoes major changes.


agreed and as observed earlier just like in the case of the whetstone benchmark, 32bits codes actually turned out *faster* than 64bits codes. that could imply for instance that those running 32 bits boinc client actually get more boinc points (credits) since boinc awards points based on the whetstone benchmark, imho a very *crude* approach but i'd guess the point is to be 'comparable' across different boinc projects.

32 bits codes and its data use possibly considerably *less memory* compared to 64 bits. it makes it more likely that the code & data running in 32 bits fits completely in the cpu cache. this can make a world of difference if say the whetstone benchmark & all its data runs completely within the cpu cache never hitting dram.

my thoughts are that 64 bits codes used so much more memory that cpu cache is trashed making it necessary to move data to / from memory for all that computations which hit the whetstone benchmark.

as it seemed the whetstone benchmark is also impossible or infeasible to use parallel features such as SSE/AVX/AVX2 as most of its operations are memory based and in addition the codes / formulars seemed deliberately organised to thwart parallelism. Apparently for X86 or X86_64 CPUs, all the magic of floating point performance is primarily in the *FPU* and for that matter 32 bits codes or 64 bits codes do not matter as the *FPU* itself determines *all the floating point calculation performance*. Intel CPUs especially the recent cpus apparently has a *very fast high performance FPU* that accounts for the fast double precision floating point calculations performance. It is comparable to / outperforms even the lower & mid range consumer GPUs and delivers better double precision performance per watt compared to those GPUs.

For the whetstone benchmark, the (intel) CPU seem to be able to handle double precision floating point calc so rapidly taking hardly any clock cycles that the cpu spent most of the time moving data between memory and cpu registers / cache. i'd guess the memory movements could have taken much more clock cycles (say 10 clock cycles?) compared to the double precision floating point computation itself say 1 clock cycle. that seem to be reflected in how the whetstone benchmark gets more gflops in 32 bits vs 64 bits
ID: 80838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80839 - Posted: 12 Nov 2016, 17:16:47 UTC
Last modified: 12 Nov 2016, 17:39:28 UTC

here is a very interesting article / slides on *AVX/AVX2*, and from CERN the HPC (high performance computing) people who deal with *physics*

Haswell Conundrum:AVX or not AVX?
https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf
in 2014
Conclusions

– Free lunch is over

» In 2 years the computational power of Intel workstations has increased
by 30% max (including core count and freq-boost)

» For servers even less

– Power management affects individual components:

» Achieving maximal throughput requires to make choices among features
to activate

– Memory wall is higher than ever

» HSW improves on instruction caching though..

– Wide SIMD vectors are effective only for highly specialized code

– Little support for this new brave world in generic high level
languages and libraries



Summary

– Haswell is a great new Architecture:

» Not because of AVX

– Long SIMD vectors are worth only for intensive vectorized code

» Are not GPUs then a better option?

– Power Management cannot be ignored while assessing
computational efficiency

– On modern architecture, extrapolation based on synthetic
benchmarks is mission impossible


they are in Boinc too & u can run their simulations:
http://atlasathome.cern.ch/


that *special scenario* is apparently things like Linpack benchmark that depends heavily on subroutine DGEMM (double precision general matrix multiplication), e.g. multiply very *big/large* *square matrices* say 10,000 x 10,000

https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493/

once the math scenario falls outside this DGEMM multiply very big square matrices use case, all that vector / parallel cpu and even those extreme speed GPU (*petaflops*) hardware is simply *useless*, e.g. if you are trying to solve 2x2 matrices a billion times and the result of the next iteration depend on the previous, it would be just as slow as if you simply do it in loops no SSE,AVX,AVX2 lol

in short SSE/AVX to all those super high end vectorized extreme performance GPU is only good if the whole world is simply DGEMM. too bad DGEMM is just very few of true real world scenarios lol
ID: 80839 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,582,094
RAC: 7,913
Message 81074 - Posted: 22 Jan 2017, 20:02:14 UTC
Last modified: 22 Jan 2017, 20:02:43 UTC

10 years ago a volunteer asked for 64 bit and he did one experiment with rosetta.
Now, in 2017, i think it's time to start with tests on 64 bit app (and, maybe, optimizations)
ID: 81074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 81084 - Posted: 24 Jan 2017, 17:56:43 UTC
Last modified: 24 Jan 2017, 18:30:39 UTC

imho microprocessors has reached the point of zero marginal improvements at 64 bits, based on the concept of Amdahl's law
https://en.wikipedia.org/wiki/Amdahl's_law
microprocessors is only as fast as its slowest bottleneck that cannot be made faster. Intel (and possibly AMD) X86* microprocessors with all its advances has very fast FPUs (even if the core codes runs at 32 bits, these FPUs runs at 80 bits all the time, perhaps even when SSE/AVX/AVXn is used)
https://en.wikipedia.org/wiki/X87
the hardware optimizations (e.g. including superscalar instruction level parallelism) has perhaps reach a point may be it takes a single clock cycle to do some simple double precision floating point computations, and it can go no faster than that, and all the other chain of instructions and data processing is simply memory limited in its processing speed, and for some things you either vectorise and parallel process it or if that cannot be done such as the next step depends on the output of the previous step, this is as fast as it ever get

(the early 64bits AMD64 processors worked faster 64bits vs 32bits, because those recent hardware optimizations (e.g. skylake, kabylake) that automatically fetch data say 64 bits doubles in hardware completely bypassing the old 32 bit ancient logic has not been designed/invented back then. today hardware prefetch instructions and use cache efficiently performing speculative execution overcoming all the 32 bits vs 64 bits distinction)

and with transistors reaching 13nm (10nm next), it's probably impossible to make those transistors any smaller (the quantum uncertainty principal comes into play & it may no longer be possible to keep a transistor working as a transistor any smaller) - end of moores law

this is an oversimplification, but it bring across the point that zero marginal improvements has been reached. hardware and software optimizations and transistor sizes has been pushed to limits where in there can be no further improvements regardless if it is 32 or 64 bits or more

i think this is true even in ARM microprocessors where 64bits microprocessors only show a marginal (or even no) improvement over 32bits ARM microprocessors
http://www.roylongbottom.org.uk/linpack%20results.htm#anchorAndroid
 Raspberry Pi 2
 gcc 4.8                       DP        SP
 CPU          MHz     Linux    MFLOPS    MFLOPS
 ARM V7A     1000     3.18.5    169       176

 Raspberry Pi 3                                
 gcc 4.8                       DP        SP
 CPU          MHz     Linux    MFLOPS    MFLOPS
 ARM v8-A53  1200     4.1.19    180       194  


if we take 32 bits Raspberry Pi 2 and overclock that to 1.2 ghz
then assuming 169 * 1.2 = 202.8 Mflops
that implies 32 bits ARM could actually exceed that in 64 bits double precision linpack benchmark clock-for-clock
ID: 81084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 81086 - Posted: 24 Jan 2017, 21:09:12 UTC

the main advantage for going 64bits is then more to do with memory than with processing speeds, it can easily address > 4GB boundaries. the downside is that 64bits apps consume more memory for the same app
ID: 81086 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,582,094
RAC: 7,913
Message 81087 - Posted: 25 Jan 2017, 10:17:02 UTC - in response to Message 81086.  

the main advantage for going 64bits is then more to do with memory than with processing speeds, it can easily address > 4GB boundaries. the downside is that 64bits apps consume more memory for the same app


This is my idea. I'm thinking of ram, to crunch bigger simulations (with, for example, the possibility to select "big app" in user's profile).
For performances, eternal waiting of SSEx, Avx, Fma.. :-)

ID: 81087 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Will there be a 64-bit client in the near future?



©2024 University of Washington
https://www.bakerlab.org