300+ TeraFLOPS sustained!

Author	Message
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 79760 - Posted: 17 Mar 2016, 13:47:32 UTC Last modified: 17 Mar 2016, 13:48:26 UTC Looks like a big boost in CE participation has pushed Rosetta@Home well over the 300 TeraFLOP mark. Wondering if this has anyone at Baker lab thinking up any new experiments to run that may be more viable now than in the past or this little boost is still orders of magnitude away from being a game changer just yet? I know things aren't that simplistic, and real progress likely comes from evolution of the algorithms behind the models, but I'm sure there are thresholds where new things become possible.. Maybe its not at 320TeraFLOP/S though, maybe its at 300 ExaFLOP/S Still interesting to ponder over. Progress for the win! ID: 79760 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2189 Credit: 13,720,774 RAC: 2,598	Message 79764 - Posted: 20 Mar 2016, 21:30:54 UTC - in response to Message 79760. Last modified: 20 Mar 2016, 21:31:42 UTC Looks like a big boost in CE participation has pushed Rosetta@Home well over the 300 TeraFLOP mark. Wondering if this has anyone at Baker lab thinking up any new experiments to run that may be more viable now than in the past or this little boost is still orders of magnitude away from being a game changer just yet? After all discussions about gpu/cpu optimization/etc, i think they are not so interested in additional computational power. ID: 79764 · Rating: 0 · rate: / Reply Quote

ssoxcub@yahoo.com Send message Joined: 8 Jan 12 Posts: 17 Credit: 503,947 RAC: 0	Message 79766 - Posted: 21 Mar 2016, 6:59:44 UTC I think they should constantly improve the code as folding@home does. From personal experience a nvidia 760 gets about 80,000, while after they improved the amd code, a r9 390 pulls down 300,000 points a day. Not sure if they could ever use a amd processor because of its math deficits. But another thought is, you can get a older cpu that would hold its own, say 8 years old, which is an extremely long time, but a 8 year old gpu would be outclassed x100 or even a x1000. ID: 79766 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79832 - Posted: 2 Apr 2016, 12:30:30 UTC would be fun an eye popper if rosetta@home reaches the petaflops benchmark, lets keep it up :) ID: 79832 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 79833 - Posted: 2 Apr 2016, 14:55:58 UTC - in response to Message 79766. But another thought is, you can get a older cpu that would hold its own, say 8 years old, which is an extremely long time, but a 8 year old gpu would be outclassed x100 or even a x1000. Nvidia keeps improving CUDA, and supposedly making it easier to use. Maybe by the time Volta comes out, it would be worthwhile for Baker Labs to hire some smart grad student to look into it. ID: 79833 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2189 Credit: 13,720,774 RAC: 2,598	Message 79834 - Posted: 2 Apr 2016, 17:17:47 UTC - in response to Message 79833. Nvidia keeps improving CUDA, and supposedly making it easier to use. Maybe by the time Volta comes out, it would be worthwhile for Baker Labs to hire some smart grad student to look into it. On the other side of the moon, Kronos Group, AMD, Altera, Intel and others keep improving OpenCl and supposedly making it easier to use. May be the time Vega comes out, it would be worthwhile for Baker Labs to hire some smart grad student to look into it. :-) ID: 79834 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 79836 - Posted: 2 Apr 2016, 20:10:01 UTC - in response to Message 79834. Last modified: 2 Apr 2016, 20:12:25 UTC On the other side of the moon, Kronos Group, AMD, Altera, Intel and others keep improving OpenCl and supposedly making it easier to use. May be the time Vega comes out, it would be worthwhile for Baker Labs to hire some smart grad student to look into it. I am happy to go either way, assuming AMD is still in business. ID: 79836 · Rating: 0 · rate: / Reply Quote

Emigdio Lopez Laburu Send message Joined: 25 Feb 06 Posts: 61 Credit: 40,240,061 RAC: 0	Message 79839 - Posted: 4 Apr 2016, 13:38:42 UTC - in response to Message 79764. Looks like a big boost in CE participation has pushed Rosetta@Home well over the 300 TeraFLOP mark. Wondering if this has anyone at Baker lab thinking up any new experiments to run that may be more viable now than in the past or this little boost is still orders of magnitude away from being a game changer just yet? After all discussions about gpu/cpu optimization/etc, i think they are not so interested in additional computational power. Why you say that??? ID: 79839 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2189 Credit: 13,720,774 RAC: 2,598	Message 79841 - Posted: 4 Apr 2016, 18:04:00 UTC - in response to Message 79839. Last modified: 4 Apr 2016, 18:04:54 UTC After all discussions about gpu/cpu optimization/etc, i think they are not so interested in additional computational power. Why you say that??? Despite some very interesting preliminary tests, seems that they abandon the optimization scope. Please, read the discussions here and on Ralph's forum (here, for example) - Only one admin partecipate (Dekim) - This admin does not work very hard on optimizations (he has other things to do) - He says that optimizations are not so important, "precision" of simulation is more important than speed. - The optimization are commit to one volunteer (Rsj5), who works on code when he has free time. So, i'm not so optimist ID: 79841 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79842 - Posted: 4 Apr 2016, 20:51:17 UTC - in response to Message 79841. After all discussions about gpu/cpu optimization/etc, i think they are not so interested in additional computational power. Why you say that??? Despite some very interesting preliminary tests, seems that they abandon the optimization scope. Please, read the discussions here and on Ralph's forum (here, for example) - Only one admin partecipate (Dekim) - This admin does not work very hard on optimizations (he has other things to do) - He says that optimizations are not so important, "precision" of simulation is more important than speed. - The optimization are commit to one volunteer (Rsj5), who works on code when he has free time. So, i'm not so optimist Be more optimistic ... and as patient as you can. 8-) Not all is bad. I have thought about updating status several times, but I thought that it might be more appropriate for those on the project (dekim) to disclose plans/status. He can delete this message if I am off base ... since I did not ask. There is another lab student working on incorporating my findings into their production environment. They are busy but I have been feeding them measurements and configuration files. To summarize, I built 50+ binaries with selected option combinations and expected (as I had said before) about 20% improvement. I generally measured a 20% to 40% improvement and dekim said they had confirmed those numbers internally. I also said that it would require the compiler to auto-vectorize the code to go faster than 20%. The original source code, I think, was written in Fortran, and translated to C++. Ugh! Dekim indicated that they have built and deployed a test binary based on my suggestions on Ralph. I don't know which one he is talking about but v3.73 was released about the right time. He also indicated they have introduced an optimized binary into their local production clusters ... whatever that is. They are seeing more than 2x-4x improvement on one of their design protocols executing on that cluster. I will be interested in learning why the dramatic impact. They are being careful, because this involves changing compilers and options. They are also in the middle of a big change ... notice the size of the database increased from 180mb to 270mb ... 8-) I have set my Rosetta preferences to run 24 hour jobs so I can see when (if) a better binary is introduced. The easiest way to detect a changed binary is to run with the longer CPU target times and observe those that finish before the target 86,400 second CPU time stick out. ID: 79842 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 79843 - Posted: 4 Apr 2016, 21:19:05 UTC - in response to Message 79842. I have set my Rosetta preferences to run 24 hour jobs so I can see when (if) a better binary is introduced. The easiest way to detect a changed binary is to run with the longer CPU target times and observe those that finish before the target 86,400 second CPU time stick out. I always run 24 hours on six cores of my i7-4790 (Win7 64-bit), and have seen several short work units since 2 April, when I started working on 3.73. 24 hour tasks You have done something very right it seems. ID: 79843 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 79844 - Posted: 4 Apr 2016, 21:51:18 UTC @rjs5, Thanks for the update. Wanted to point out that 24hr work units, running more efficiently will simply produce more models in as close to 24hrs as they can. So, you won't notice them completing 20-40% sooner. Each time a new model is begun, a check is made to estimate whether it will complete before the runtime preference set in the user's settings. I believe the estimate is just based on time taken to complete prior models on the same task. So if model 23 completes after 23.5hrs of CPU, then the task is ended and returned. If model 23 completes after 22.5hrs, then a 24th model begins. Rosetta Moderator: Mod.Sense ID: 79844 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 79845 - Posted: 4 Apr 2016, 23:11:59 UTC - in response to Message 79844. Last modified: 4 Apr 2016, 23:14:34 UTC Thanks for the update. Wanted to point out that 24hr work units, running more efficiently will simply produce more models in as close to 24hrs as they can. I think I see what you are saying. You put as many apples of various sizes in the box without overflowing. However, I have seen several tasks that run under 10,000 seconds on the above (and three other) machines in only two days. I think that is very rare, and after checking it is only on the 3.73 tasks. Also, if they are that short, you would think there would be plenty of room to fit another model in. So it seems that something is making the run times shorter than before, and preventing another model from being run. Maybe there is a limit on the total number of models? ID: 79845 · Rating: 0 · rate: / Reply Quote

Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0	Message 79847 - Posted: 5 Apr 2016, 10:27:36 UTC - in response to Message 79845. Thanks for the update. Wanted to point out that 24hr work units, running more efficiently will simply produce more models in as close to 24hrs as they can. I think I see what you are saying. You put as many apples of various sizes in the box without overflowing. However, I have seen several tasks that run under 10,000 seconds on the above (and three other) machines in only two days. I think that is very rare, and after checking it is only on the 3.73 tasks. Also, if they are that short, you would think there would be plenty of room to fit another model in. So it seems that something is making the run times shorter than before, and preventing another model from being run. Maybe there is a limit on the total number of models? It depends on the type of tasks. I'll just copy and paste what I wrote earlier and perhaps Mod.sense or DEK can correct and/or add detail as necessary: If memory serves, the 99 model limit was enacted when some tasks created output files too large to be uploaded. The limit only applies to a particular type of task. Others use the preferred cpu time plus 4 method to determine when to end things. When a model is completed the task calculates whether it has time left to complete another model. If the answer is no then the task wraps things up despite there appearing (to the cruncher) hours left. if the answer is yes the tasks will begin another model. All models aren't equal however, even within the same task so some will take longer than predicted. To insure that otherwise good models aren't cut short just before completing (and to increase the odds that the task will complete at least one model) the task will continue past the preferred cpu time. At some point though, you gotta cut your losses and so at preferred cpu time plus 4 hours the watchdog cuts bait and the task goes home. ( I'm curious about the average overtime; my totally uninformed guess is that it's less than an hour.) There are other types of tasks in which filters are employed to cut off models early. If the model passes the filter it will continue working on that one task to the end. This results in dramatically disparate counts, with one task generating hundreds of models while another task from the same batch only generating one, two, five, etc. Recently on ralph a filter was used to remove models resulting in a file transfer error upon upload. The stderr out listed 13 models from 2 attempts but since the models had been erased the file meant to contain them didn't exist. I'm guessing, based on DEK's post, which I may well have misinterpreted, that the server, possibly as part of a validation check, automatically gives the file transfer error (client error, compute error) when this particular file isn't part of the upload. All these different strategies result, from the cruncher's point of view, in varied behavior which we struggle to interpret. Is it a problem with my computer or a problem with rosetta? Is it a problem at all? BOINC is complicated enough for the computer savvy, much more so for majority of crunchers who just want to maximize their participation in rosetta and end up massively tangled up in the BOINC settings. The variety of legitimate behaviors exhibited by rosetta tasks trips up the volunteers trying to help them become untangled. From the researcher' point of view everything may look fine, working as expected, and any issues a lone cruncher is having is most likely due to their particular set up. And it probably is, but the lack of information leaves the volunteers flailing. I have long wished for a reference, a database of tasks, in which the tasks are divided into broad categories of strategies employed (as above, which some info on how they "look " to the crunchers) and what, in a most basic way, is being asked (how does this particular protein fold, how do these two proteins interact, can we create a new protein to do x, etc.) Best, Snags ID: 79847 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79848 - Posted: 5 Apr 2016, 13:09:18 UTC - in response to Message 79845. Thanks for the update. Wanted to point out that 24hr work units, running more efficiently will simply produce more models in as close to 24hrs as they can. I think I see what you are saying. You put as many apples of various sizes in the box without overflowing. However, I have seen several tasks that run under 10,000 seconds on the above (and three other) machines in only two days. I think that is very rare, and after checking it is only on the 3.73 tasks. Also, if they are that short, you would think there would be plenty of room to fit another model in. So it seems that something is making the run times shorter than before, and preventing another model from being run. Maybe there is a limit on the total number of models? The 3.73 jobs hit my machines about 10am 3/31. My SkyLake 6700k with Win10 machine is only taking 86,400 seconds on tasks that run multiple structures. It looks like there are several ways of running jobs and possibly source of the confusion. ALL "24 hour" jobs that finished early. CPU time (sec) -- Task ID 9,231 -- 806473699 9,474 -- 806473717 10,616 -- 802461224 10,736 -- 806473700 11,629 -- 802461073 19,048 -- 802461280 19,727 -- 802461293 25,353 -- 802461165 28,458 -- 802461288 31,028 -- 806739396 31,109 -- 806739333 31,152 -- 806739395 32,629 -- 806739285 32,775 -- 806739281 32,788 -- 806739332 32,897 -- 806739284 74,202 -- 802461278 86,645 -- 802461299 <<< multiple structures tj_3_15_dimer_X_ZC16v1_DHR54_l3_h22_l3_v11_0_v1b_fragments_abinitio_SAVE_ALL_OUT_339362_541_0 86,825 -- 802461222 <<< multiple structures tj_3_15_dimer_X_ZC16v1_DHR54_l3_h22_l3_v11_0_v1 My Haswell Extreme Win10 machine did not get any of the "tj" jobs and no job took 24 hours. CPU time (sec) -- Task ID 26,147 -- 806166140 26,527 -- 806783950 27,408 -- 806166136 28,310 -- 806166173 28,716 -- 806166175 28,779 -- 806166153 28,946 -- 806166142 29,498 -- 806166158 29,656 -- 806166144 29,826 -- 806166155 30,031 -- 806166182 30,319 -- 806166135 31,056 -- 806166156 31,949 -- 806166137 32,495 -- 806166183 33,339 -- 806166141 33,441 -- 806166181 33,823 -- 806166154 34,171 -- 806166174 35,218 -- 806166143 37,461 -- 806166157 39,680 -- 806784007 41,060 -- 806784004 41,733 -- 806784024 42,119 -- 806784008 42,459 -- 806783991 42,714 -- 806784020 43,282 -- 806784012 45,064 -- 806783066 46,650 -- 806783178 46,709 -- 806783780 47,901 -- 806783261 48,229 -- 806783240 48,300 -- 806783221 48,708 -- 806783141 49,017 -- 806783220 49,049 -- 806783231 51,220 -- 806783233 51,305 -- 806783258 54,612 -- 806784017 ID: 79848 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 79849 - Posted: 5 Apr 2016, 14:55:31 UTC - in response to Message 79848. My SkyLake 6700k with Win10 machine is only taking 86,400 seconds on tasks that run multiple structures. It looks like there are several ways of running jobs and possibly source of the confusion. Can you reach a conclusion yet? Is it clear that there are gains, or is that yet to be sorted out? ID: 79849 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79850 - Posted: 5 Apr 2016, 16:12:13 UTC - in response to Message 79849. My SkyLake 6700k with Win10 machine is only taking 86,400 seconds on tasks that run multiple structures. It looks like there are several ways of running jobs and possibly source of the confusion. Can you reach a conclusion yet? Is it clear that there are gains, or is that yet to be sorted out? Looks good. There are gains ... its the "how much" that is harder to determine. Performance is always a "work in progress". That is why you have to be careful in "optimizing" something. Everyone who follows assumes the the optimizations still work. Rosetta is a moving target and the run time statistics are very difficult to extract on this side of the server. The data ages out too quickly and the task name/information is buried another level deep. In very round numbers, I think there is generally 20%-50% in compiler and option "low hanging fruit". Timing of the deployed binary is out of my control and vision. Properly written code will see a 2x-4x improvement. ID: 79850 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2189 Credit: 13,720,774 RAC: 2,598	Message 79857 - Posted: 8 Apr 2016, 14:59:06 UTC - in response to Message 79842. Be more optimistic ... and as patient as you can. 8-) I'm here since 2005, so i'm patient :-) To summarize, I built 50+ binaries with selected option combinations and expected (as I had said before) about 20% improvement. I generally measured a 20% to 40% improvement and dekim said they had confirmed those numbers internally. Not bad! The original source code, I think, was written in Fortran, and translated to C++. Ugh! Yep, i think there are still some traces of Fortran Dekim indicated that they have built and deployed a test binary based on my suggestions on Ralph. I don't know which one he is talking about but v3.73 was released about the right time. He also indicated they have introduced an optimized binary into their local production clusters ... whatever that is. They are seeing more than 2x-4x improvement on one of their design protocols executing on that cluster. I will be interested in learning why the dramatic impact. Only Dekim can answer.... ID: 79857 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 79858 - Posted: 8 Apr 2016, 15:30:34 UTC While we are on the subject, I am presently on Win7 64-bit. But I could go to Linux Mint 18 when it comes out. Is there an advantage? ID: 79858 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79860 - Posted: 8 Apr 2016, 16:39:42 UTC - in response to Message 79858. While we are on the subject, I am presently on Win7 64-bit. But I could go to Linux Mint 18 when it comes out. Is there an advantage? 8-} hit POST instead of PREVIEW ... If you are curious, you might install a VM and then install Mint on it. You can compare the performance of the 32-bit windows binary with the 64-bit Linux version. The last time I tried this, the VM was about 10% faster. ID: 79860 · Rating: 0 · rate: / Reply Quote