Message boards : Number crunching : Output versus work unit size
Author | Message |
---|---|
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I have found an interesting (inverse) correlation between the BOINC credits given per work unit (presumably representing work output), and the memory size of the work unit. That is, the more "Peak working set size", etc., the lower the credit. These tests were done on two identical i7-3770 machines, one on Ubuntu 16.04 and the other on Windows 7 64-bit. At first, I thought that Windows used less memory and was therefore more consistent, but that is not so clear, and may just be the samples I got. Both show the same trend in output vs. size. The machines are not overclocked, and have plenty (32 GB) of main memory, so there is no limitation there. Here are a few representative examples (all ran for 24 hours): Ubuntu 16.04 Workunit 892250257 Peak working set size 473.90 MB Peak swap size 540.06 MB Credit 817.66 Workunit 892475300 Peak working set size 669.00 MB Peak swap size 816.87 MB Credit 139.50 Windows 7 64-bit Workunit 891930096 Peak working set size 262.60 MB Peak swap size 246.60 MB Credit 739.62 Workunit 891962122 Peak working set size 699.84 MB Peak swap size 680.24 MB Credit 254.88 I presume this is related to how well the work unit fits into the CPU cache memory. I don't know anything that we can do about it. But I will also be trying the tests on Haswell machines, which have the same cache size as the Ivy Bridge machines, but use it a little more efficiently. It might make a difference. If anyone wants to try it on the later Intel CPUs, that would be useful too. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
But I will also be trying the tests on Haswell machines, which have the same cache size as the Ivy Bridge machines, but use it a little more efficiently. It might make a difference. On my i7-4790 (Ubuntu 16.04), the results are initially a mixed bag. https://boinc.bakerlab.org/results.php?hostid=3347860&offset=0&show_names=0&state=4&appid= That is, the first six tasks are all low output, even though some of them used relatively little memory (below 500 MB, and some even below 400 MB). However, that is apparently because I was also running LHC on that machine, which uses VirtualBox. Who knows how that uses the cache, except that it appears to interfere with Rosetta in a major way. Since Task 990966032 (work unit 892956769), I have ended all LHC work, and am running only Rosetta on seven cores, with the other core being reserved to support a GTX 970 on Folding. The results are improved at least initially; we will see how far it goes. But I suspect one lesson will be that Rosetta is best run by itself. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Since Task 990966032 (work unit 892956769), I have ended all LHC work, and am running only Rosetta on seven cores, with the other core being reserved to support a GTX 970 on Folding. The results are improved at least initially; we will see how far it goes. But I suspect one lesson will be that Rosetta is best run by itself. i7-4770 (Ubuntu 16.04): I have finished all the Rosettas that I am going to do on the i7-4790, and they mostly work as expected. It clearly helps to run them by themselves, and not with LHC. https://boinc.bakerlab.org/rosetta/results.php?hostid=3347860&offset=0&show_names=0&state=4&appid= But there are three with low output. The first one is easily explained by having a large memory size, presumably impacting the cache. Workunit 893211407 Credit 169.09 Peak working set size 695.79 MB Peak swap size 837.02 MB The second one has low output, even with low memory usage. There must be something else different about it. Workunit 893268012 Credit 182.42 Peak working set size 409.10 MB Peak swap size 470.95 MB The last one seems to be on the boarder of memory size, which seems to be around 500 MB before the output is reduced. Workunit 893460799 Credit 153.69 Peak working set size 499.68 MB Peak swap size 566.25 MB i7-4771: Another Haswell machine is running Rosetta on only six cores under Win7 64-bit. It is doing consistently well, indicating that running on less than the maximum number of cores may help avoid cache problem, though there are only six samples at present. But I will keep it going longer term, so we will see. https://boinc.bakerlab.org/rosetta/results.php?hostid=3118747&offset=0&show_names=0&state=4&appid= i7-3770: Finally, I still have an Ivy Bridge running Rosetta on Win7 64-bit, but it is also running CPDN. The Rosetta output is mid-range between low and high, indicating that CPDN does not take as much cache as LHC does, leaving more for Rosetta. https://boinc.bakerlab.org/rosetta/results.php?hostid=3381276&offset=0&show_names=0&state=4&appid= I will be ending the CPDN work on that in a few days, so the Rosetta output will hopefully go higher. Lessons learned: The main lesson is that Rosetta is very picky about the environment in which it runs, and it does not necessarily get along well with other projects (or vice-versa). I am not sure the Rosetta developers take that into account when they are developing the apps. They may just run Rosetta by itself, and if it works, that is good enough. But that is not how it is used in the real world, and it would be better if they considered the impact of other projects running along with it. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Finally (wonder of wonders), my Ryzen 1700 now works very well on Rosetta by eliminating other projects (e.g., WCG) and running only Rosetta. https://boinc.bakerlab.org/rosetta/results.php?hostid=3390530&offset=0&show_names=0&state=4&appid= Previously, I had gotten a high error rate and very inconsistent output. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88341#88341 Now, after some startup glitches, I get only an occasional error, and the output is about the same as for my Haswell/Ivy Bridge machines. The only reduced output seems to be from unusually large memory size works units, the same as I would get on the Intel machines. So I think Ryzen is now perfectly viable. But the hitch is that you have to run only Rosetta. It becomes less than ideal for mixed projects, or even backups, since you would be running Rosetta alongside something else. Maybe they can get this behavior fixed, or it may just be inherent in the science they are doing that it needs that much cache and is not good at sharing it. I can devote one machine to Rosetta, but not everyone can do that. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
Finally (wonder of wonders), my Ryzen 1700 now works very well on Rosetta by eliminating other projects (e.g., WCG) and running only Rosetta. I have not seen the 4.0 source code but the 378 source code was "unusual" and a product of many developers and many years of changes. The code is a collection of models that the developer can chose from their command line. When they develop a new model, they just glued it into the existing structure. They use C++ and lots of parameters, so you get a function passing values to another functions to do a simple operation. The WU are parceled out and run on your machine for the TIME slice that you have selected. The WU loops on the tests until your time limit is reached and then it stops when done. The credits are "how much work" thy think you did in that time. The job is parceled out until the accumulation of completed slices finishes the job. They have some "algorithm" for "normalizing" that work and making sure the credit allocations are "fair". The credits can vary by he model and even the particular slice of WU you get. That makes it VERY tough to figure out what the performance is. An approximation over a long term is probably the best that can be done. Much of their computation is 3-dimensional point to point math, like finding the length of a vector. If they simply "padded" their vector to a 4th dimension, the operation that sequentially load, operation, store 3 times could be accomplished by a load, operation, store of an AVX or SSE2 register. The Rosetta developers did not believe my numbers and did not have time. David E.K. has done some work verifying my numbers and his results were very promising. All the developers are working on a new model to paste into Rosetta and I don't think there is anyone managing the whole structure. I just looked an they have 13 million queued jobs 33k ready to send and 400k in compute. IMO, they are not interested in any effort to optimize their "working" system. However, doubling the performance of the application CAN be viewed as reducing the power that is needed to compute. I have seen your Rosetta results and sensitivity to other running jobs. I would not hold my breath waiting for Rosetta to reply. The moderators of the forums are pretty good and try to help. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I have seen your Rosetta results and sensitivity to other running jobs. I would not hold my breath waiting for Rosetta to reply. The moderators of the forums are pretty good and try to help. Yes, I am "doing it myself" for that very reason. You are right to imply (if indirectly) that the BOINC numbers may all be a figment of the imagination anyway. Who knows how much work my machines are really doing? Thanks a lot for you input, especially to Rosetta. At some point the little light will go on, and they will accept it. They will have to in order to make progress eventually. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Also, for consistently high output, it helps to leave a core or two free. I am running on six cores of my i7-4771 (Windows 7 64-bit), and have seen no drops yet. https://boinc.bakerlab.org/rosetta/results.php?hostid=3118747&offset=0&show_names=0&state=4&appid= This is presumably because the cache is shared among all the cores, so leaving some free increases the cache available for the active cores. Since the drops are quite large, typically a factor of four (700 points down to 165 for example), this strategy should increase the total output. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, R@h is memory intensive. Any memory intensive application is potentially going to be labelled as not playing well with others. It is just how memory contention works in a system. So I don't see a specific problem with your scenario. But wanted to assure you that the developers do look at memory usage and attempt to improve the algorithms used to dial back the use of memory where possible. Also wanted to point out that you said in prior posts that R@h doesn't play well with others, which always sounds like a skirmish for resources and people often invent logic that says it is the application being aggressive, when in fact such things are controlled by the operating system. But I wanted to point out that your last post essentially now boils down to you saying that R@h doesn't play well with itself either. So, at least there is no bias on what is being impacted. As you say, L2 cache contention is going to crop up with any memory intensive application. The larger the L2 cache, the faster any memory intensive application will run. One approach to optimizing the work on a machine is to get a mixture of work with lower memory requirements. I often suggest people attach to world community grid. Their projects have humanitarian and medical implications, and typically have much lower memory requirements. You can define your preference for mixture of work using the "resource share" for each project. So, for example a resource share of 70% R@h and 30% WCG, you could setup R@h with resource share of 700 and WCG with resource share of 300. On an 8 core system, that would typically result in at least two WCG tasks running alongside 6 R@h tasks. This mix is often enough to make full use of the cores that you just suggested leaving idle. Rosetta Moderator: Mod.Sense |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Thanks for the input. That is the usual suggestion on WCG also: run a mix of projects. But that by itself doesn't seem to cure the problem with Rosetta. It appears that you need to limit the cores too. I think that should be more widely known; I doubt that most crunchers realize it. |
PappaLitto Send message Joined: 14 Nov 17 Posts: 17 Credit: 28,141,852 RAC: 1,402 |
I am still having an issue with my r7 1700. I run only one CPU project at a time and I restrict my CPU usage in BOINC to 85%, which equates to about 90% actual CPU usage. I am getting a whopping 10% error rate. Does anyone have any suggestions of how to fix this? |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
Much of their computation is 3-dimensional point to point math, like finding the length of a vector. If they simply "padded" their vector to a 4th dimension, the operation that sequentially load, operation, store 3 times could be accomplished by a load, operation, store of an AVX or SSE2 register. The Rosetta developers did not believe my numbers and did not have time. David E.K. has done some work verifying my numbers and his results were very promising. The same old story.... |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
I would not hold my breath waiting for Rosetta to reply. The moderators of the forums are pretty good and try to help. I appreciate moderators, but i prefer if developers/debuggers read the forum |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Even when leaving two cores free on my i7-4771, I am occasionally seeing some big drops in output. https://boinc.bakerlab.org/rosetta/results.php?hostid=3118747&offset=0&show_names=0&state=4&appid= This is the biggest memory usage I have seen: Peak working set size 1,034.36 MB Peak swap size 1,012.63 MB Credit 180.07 https://boinc.bakerlab.org/result.php?resultid=993252510 I don't know if this is a recent phenomenon, or whether this has always been the case. But I may have to free up another core if necessary. PappaLitto, It looks like allowing more cache by restricting the cores only improves the output, but does not fix the "AMD problem", whatever it is. I did not really check my Ryzen 1700 long enough to see the error rate. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,625,551 RAC: 6,845 |
It looks like allowing more cache by restricting the cores only improves the output, but does not fix the "AMD problem", whatever it is. I did not really check my Ryzen 1700 long enough to see the error rate. I'm considering, this autumn, to change my "old" AMD FX with a new Ryzen 2600/2700 to use it, mainly, for ralph/rosetta. But i'm a little uncertain... |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
Yes, R@h is memory intensive. Any memory intensive application is potentially going to be labelled as not playing well with others. It is just how memory contention works in a system. So I don't see a specific problem with your scenario. But wanted to assure you that the developers do look at memory usage and attempt to improve the algorithms used to dial back the use of memory where possible. Also wanted to point out that you said in prior posts that R@h doesn't play well with others, which always sounds like a skirmish for resources and people often invent logic that says it is the application being aggressive, when in fact such things are controlled by the operating system. But I wanted to point out that your last post essentially now boils down to you saying that R@h doesn't play well with itself either. So, at least there is no bias on what is being impacted. As you say, L2 cache contention is going to crop up with any memory intensive application. The larger the L2 cache, the faster any memory intensive application will run. I guess you are addressing me. I really don't know what the developers pay attention to. I just make my conclusions based on empirical observations. IMO, PrimeGrid is probably the project with the biggest optimization problems. They have over-tuned the code. R@H has done some simple things, but they have overlooked issues that are typically not understood by developers. What they have done is fine with me. Their design decisions determine the power cost, network traffic, disk sizes needed, .... of the machines. IMO, they can make some changes to use those resources more efficiently. Including all the models in one binary is a design decision and it puts extra pressure on the TLB and networks. Basing a design on a library of small functions (BOOST) causes a page of code to be read into memory so the program can execute 1 function. Loading the rest of that page is overhead, takes memory and puts pressure on the TLB. Compiling the code with options like "-O3 -funroll-loops -finline-functions" unwinds the loops (makes code footprint larger) and inlining code puts a copy of the code in multiple places that take up multiple locations in memory, cache, ... If a cruncher gets WU using all the same model, the machine will use memory most efficiently and ... run faster and get more credits. If a cruncher gets WU needing 8 different models for an 8-CPU machine, the machine will run slower because the WU do not share CODE or DATA as effectively. The cruncher will be penalized for R@H less efficient use of the caches. As WU complete and drain, the kind of WU that the machine will affect running WU. A WU in the first case would give them more credits that in the second case, just because of the R@H interaction. If most of the WU use just one model, then the problem is low. If there is a lot of variation, the impact will be larger. Again, I think what R@H is doing is fine and have zero problems with their decisions. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,245,383 RAC: 9,571 |
I am still having an issue with my r7 1700. I run only one CPU project at a time and I restrict my CPU usage in BOINC to 85%, which equates to about 90% actual CPU usage. I am getting a whopping 10% error rate. Does anyone have any suggestions of how to fix this? I can't recall the issue now, but I seem to remember that setting this to 85% means that it runs at 100% 85% of the time and 0% 15% of the time and caused problems with Rosetta tasks in the past. I notice Boinc has 2 settings nowadays so that may have been resolved, but I don't know. I wouldn't use any different from 100% for safety reasons unless someone can explicitly say the issues that were seen before (whatever they were) have been resolved |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
I am still having an issue with my r7 1700. I run only one CPU project at a time and I restrict my CPU usage in BOINC to 85%, which equates to about 90% actual CPU usage. I am getting a whopping 10% error rate. Does anyone have any suggestions of how to fix this? IMO, limit the number of CPU and allow them to run 100% of the time. I have not had good luck messing with the % time option. Select the MEMORY option to NOT leave them in memory when IDLE. My 7920x with 12 CORES and 24 CPU is set to run 21 tasks. All the Linux machines are running all cores, but I started going liquid cooled on my systems. I have had pretty good luck building my own with a company called Portatech. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Select the MEMORY option to NOT leave them in memory when IDLE. That is a good idea. I am having strange results here however. On my i7-4771 (Win7 64-bit) where I am running Rosetta (only) on three full cores, with the other core used only for desktop work, I am now getting only low credits. https://boinc.bakerlab.org/rosetta/results.php?hostid=3118747 But on my i7-3770 on Ubuntu 16.04, where I am running Rosetta (only) on 7 virtual cores (with the other one reserved for a GPU for Folding), I am getting high credits. https://boinc.bakerlab.org/rosetta/results.php?hostid=3285911 I will investigate more, but there are strange things going on. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Whether it is a good idea or not depends on how frequently tasks are getting preempted. I recommend people set it so it DOES leave them "in memory", and they get swapped out if the system gets busy with other work. By NOT keeping tasks in memory, you are expressing a willingness to throw away partially completed work, i.e. willingness to lose credit in favor of more quickly getting out of the way when other demands arrive on the machine. Rosetta Moderator: Mod.Sense |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Whether it is a good idea or not depends on how frequently tasks are getting preempted. I recommend people set it so it DOES leave them "in memory", and they get swapped out if the system gets busy with other work. By NOT keeping tasks in memory, you are expressing a willingness to throw away partially completed work, i.e. willingness to lose credit in favor of more quickly getting out of the way when other demands arrive on the machine. True, and I normally have it enabled. But at the moment, I am running ONLY Rosetta, so there is nothing to preempt. (In fact, it should not matter now whether LAIM is enabled or not, since no tasks are suspended.) Maybe I can someday find another project (e.g., WCG) that plays well with Rosetta. In that case, I will have to do a little experimenting to see if LAIM is worthwhile or not. |
Message boards :
Number crunching :
Output versus work unit size
©2024 University of Washington
https://www.bakerlab.org