Message boards : Number crunching : low GPU utilization with recent Gerard CXCL12?
Author | Message |
---|---|
Dear fellow crunchers, | |
ID: 42593 | Rating: 0 | rate: / Reply Quote | |
I'm experiencing lower GPU usage (~87% on a Core i3-4130, GTX 980Ti, WinXPx64) by the GERARD_A2AR batch. | |
ID: 42594 | Rating: 0 | rate: / Reply Quote | |
Around 82 - 84% GTX 660 Ti GPU usage with AMD FX-8350 (Win 7 64). | |
ID: 42596 | Rating: 0 | rate: / Reply Quote | |
Seeing the same thing here ETA. Only 60% usage on 2 GTX 970s running on Xeon 2683 V3 and Win7 64 bit. Is running 2 at a time a possibility to boost the GPU load? | |
ID: 42599 | Rating: 0 | rate: / Reply Quote | |
Thanks, guys! | |
ID: 42601 | Rating: 0 | rate: / Reply Quote | |
The GERARD_A2AR WUs are definitely slower than the other GERARD units. On my new windows 10 machine, they finish in about 7 to 8 hours versus about 6 hours or less for the other GERARD units. On my old windows xp machine, they finish in about 12 hours versus the same 6 hours or less average for the other GERARD units. | |
ID: 42604 | Rating: 0 | rate: / Reply Quote | |
The GERARD_A2AR WUs are definitely slower than the other GERARD units.I would say these need more DP operations (done by the CPU) than others. On my old windows xp machine, they finish in about 12 hours versus the same 6 hours or less average for the other GERARD units.The performance of the GTX980Ti in your host is significantly derogated by the lack of CPU power. While your host (Athlon64 X2 Dual Core 5000+ + GTX980Ti + WinXPx86) process a GERARD_A2AR_luf6806_b in 42.449 sec, my Core2duo E8500 + GTX980 + WinXPx64 process a GERARD_A2AR_luf6632_b in just 33.928 sec, while my i3-4160 + GTX980Ti + WinXPx64 process a GERARD_A2AR_luf6632_b in 22.385 sec. This derogation is in effect for other workunits as well, but less significantly. I would suggest you to cease any CPU crunching on this host to make the GPU crunch faster. Since it has 4GB DDR2 memory, it's probably in dual channel mode already, but it's worth checking by the CPU-Z utility. | |
ID: 42605 | Rating: 0 | rate: / Reply Quote | |
Oh damn it, and this happens right after I switched to a non-overclockable Skylake! It would be cool if they'd use SSE2 - AVX2 for those CPU calculations.. but considering not even Einstein@Home, which is known for good optimizations, has switched to AVX1 yet, I don't expect such a move from GPU-Grid. Even recompilations targeting different CPUs require significantly more work & validation on the project side, whereas they're not exactly starved for more crunching power right now. | |
ID: 42609 | Rating: 0 | rate: / Reply Quote | |
The GERARD_A2AR WUs are definitely slower than the other GERARD units.I would say these need more DP operations (done by the CPU) than others. I am not doing any CPU crunching on this host and the memory is in dual channel mode. I checked it with the CPU-Z utility. This is just an old machine. That is still crunching. | |
ID: 42611 | Rating: 0 | rate: / Reply Quote | |
Oh damn it, and this happens right after I switched to a non-overclockable Skylake!Actually your CPU has enough power to drive a GTX 970, but in your host the WDDM overhead has the most impact on GPU performance (since you are using Windows 10). AMD Athlon 64 X2 5000+ (89W, rev. F3) vs Intel Core i3-6100 https://cpubenchmark.net/compare.php?cmp[]=83&cmp[]=2617 | |
ID: 42612 | Rating: 0 | rate: / Reply Quote | |
I am not doing any CPU crunching on this host and the memory is in dual channel mode. I checked it with the CPU-Z utility.I didn't thought that there is that much difference between the AMD Athlon 64 X2 5000+ and the Intel Core 2 Duo E8500 by looking at this page. However the passmark score seems to be more accurate: https://cpubenchmark.net/compare.php?cmp[]=83&cmp[]=5 The other difference between our hosts that your MB has only PCIe1.x (I think), while my DQ45CB has PCIe2.0. To achieve optimal performance of the GTX980Ti the CPU should have integrated PCIe controller (there's not that much difference if it's 2.0 or 3.0), and the OS should not have WDDM. | |
ID: 42613 | Rating: 0 | rate: / Reply Quote | |
Thanks, guys! I did a complete removal (including running driver sweeper)and reinstall of the drivers. That raised the GPU usage to 68%. I then started running 2 tasks per card and the usage went to 92-94%. It's been 24 hours and I haven't had any tasks(6)error out so far. The only issue is that every task gets stuck downloading with the same http error. Sometimes it takes an hour to get all the files for a task to run. Very annoying. | |
ID: 42614 | Rating: 0 | rate: / Reply Quote | |
I have made similar experience with GERARD_A2AR batches in the last few days. | |
ID: 42634 | Rating: 0 | rate: / Reply Quote | |
I fired up two GTX 750 Tis just to see what was going on. I am currently running: | |
ID: 42635 | Rating: 0 | rate: / Reply Quote | |
... though the A2ARs do take more time. that's what I forgot to mention in my posting above. | |
ID: 42637 | Rating: 0 | rate: / Reply Quote | |
I am seeing 89% GPU utilization on a GTX 970 running a GERARD_CXCL12_TRIM_HEP_DIM2-0. boinc@joe:~$ nvidia-smi
Sat Jan 16 10:49:11 2016
+------------------------------------------------------+
| NVIDIA-SMI 355.11 Driver Version: 355.11 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 Off | 0000:01:00.0 Off | N/A |
| 27% 71C P2 145W / 201W | 455MiB / 4094MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 31973 C ...projects/www.gpugrid.net/acemd.848-65.bin 439MiB |
+-----------------------------------------------------------------------------+
I can't easily compare to earlier work since it has all scrolled from the database during the downtime. This is in a box ( https://www.gpugrid.net/show_host_detail.php?hostid=257647 ) with the slowest, low power processor I can find and it is never leaving its lowest speed, lowest power mode ( Average Processor Power_0(Watt)=7.4779 ). Processor utilization in the underclocked mode is around 20%. So I call the "you need a better CPU talk" FUD. top - 10:55:47 up 54 days, 19:32, 1 user, load average: 0.18, 0.24, 0.26
Tasks: 104 total, 1 running, 103 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.8%sy, 9.1%ni, 89.9%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 8152388k total, 1697288k used, 6455100k free, 166632k buffers
Swap: 16678908k total, 0k used, 16678908k free, 1127352k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31973 boinc 30 10 28.4g 227m 105m S 21 2.9 2:17.59 acemd.848-65.bi
29645 boinc 20 0 81960 1732 892 S 0 0.0 0:00.17 sshd
29646 boinc 20 0 26312 7628 1772 S 0 0.1 0:00.54 bash
31943 boinc 20 0 124m 7216 3564 S 0 0.1 0:00.77 boinc
32041 boinc 20 0 17336 1436 1096 R 0 0.0 0:00.03 top
| |
ID: 42639 | Rating: 0 | rate: / Reply Quote | |
I am seeing 89% GPU utilization on a GTX 970 running a GERARD_CXCL12_TRIM_HEP_DIM2-0.I see 95%~97% GPU usage on Gerard tasks except for the A2AR. This is in a box ( https://www.gpugrid.net/show_host_detail.php?hostid=257647 ) with the slowest, low power processor I can find and it is never leaving its lowest speed, lowest power mode ( Average Processor Power_0(Watt)=7.4779 ). Processor utilization in the underclocked mode is around 20%.That's because you don't use the swan_sync environmental variable to reserve a full CPU core, to maximize the GPU usage. (which is recommended for a GTX 980 Ti, especially for workunits like the GERARD_A2AR) So I call the "you need a better CPU talk" FUD.Your system is optimized for low power, not for performance. Your CPU has integrated PCIe controller (just as I suggested), and 6 years younger than the one which the "better CPU talk" regards. | |
ID: 42641 | Rating: 0 | rate: / Reply Quote | |
Hello everyone, | |
ID: 42643 | Rating: 0 | rate: / Reply Quote | |
That's because you don't use the swan_sync environmental variable to reserve a full CPU core, to maximize the GPU usage. (which is recommended for a GTX 980 Ti, especially for workunits like the GERARD_A2AR) I tried swan_sync a few years ago and it made no difference. I just tried it gain, setting both swan_sync and SWAN_SYNC since there is some confusion which to use, and it made no difference in either CPU usage or GPU usage. Your system is optimized for low power, not for performance. Your CPU has integrated PCIe controller (just as I suggested), and 6 years younger than the one which the "better CPU talk" regards. I seem to have skipped over your discussion of older processors. The OP's processor came out Q3'15. The celeron I used came out Q4'11. His is a bit newer than mine. I built this series of systems with the goal of optimizing GPU performance. Reduced power consumption is a freebe. Every test build that ran even a single CPU work no matter how many cores were available resulted in reduced GPU performance. The celeron with 0 CPU projects outperformed an i7 with one core reserved in every test. Reserving a core improved GPU performance. Reserving the entire processor improved it a little more. I could just as easily have used an i7 as the celeron. The result would likely be the same. The processor is sleeping most of the time. A more modern processor has even more sleep states so would likely use even less power idling. But, why idle a 300 dollar processor when I can do the same job idling a 30 dollar processor. There may be some value in debating whether limiting boinc to 75% to reserve 2 threads is the same as reserving a core on an i7, or whether it is necessary to disable hyperthreading to guarantee an idle core. Likewise there may be value in debating whether a more modern processor with a faster clock speed will do a better job when it actually runs than the old celeron. But I believe that gets pretty far down into the weeds. I would get out and push if I thought it would make it go any faster, so no, I did not optimize for power. I optimized for performance. And this GTX970 did better all alone on the celeron than it did fed by an i7 that had 75% of the cpu assigned to CPU tasks. The point I tried to make, and apparently failed, is that the processor does not appear to be offloading any floating point work from the GPU as was suggested and thus a faster processor is unlikely to boost GPU utilization. At least, even if it is, any offload is easily performed by a 1.6 GHz celeron in its sleep. | |
ID: 42644 | Rating: 0 | rate: / Reply Quote | |
I tried swan_sync a few years ago and it made no difference. I just tried it gain, setting both swan_sync and SWAN_SYNC since there is some confusion which to use, and it made no difference in either CPU usage or GPU usage.When the SWAN_SYNC is in effect, the CPU time is near to the run time. If the CPU usage of the ACEMD app is lower than a full core (or thread), then the SWAN_SYNC is ignored for some environmental reasons. I built this series of systems with the goal of optimizing GPU performance. Reduced power consumption is a freebe. Every test build that ran even a single CPU work no matter how many cores were available resulted in reduced GPU performance. The celeron with 0 CPU projects outperformed an i7 with one core reserved in every test. Reserving a core improved GPU performance. Reserving the entire processor improved it a little more.That aligns with my experience. Jeremy Zimmerman did extensive tests on the number of threads vs performance almost 2 years ago, and published his results. (You have to open the link two times to get directly to his post.) I could just as easily have used an i7 as the celeron. The result would likely be the same. The processor is sleeping most of the time. A more modern processor has even more sleep states so would likely use even less power idling. But, why idle a 300 dollar processor when I can do the same job idling a 30 dollar processor. There may be some value in debating whether limiting boinc to 75% to reserve 2 threads is the same as reserving a core on an i7, or whether it is necessary to disable hyperthreading to guarantee an idle core.I agree. Likewise there may be value in debating whether a more modern processor with a faster clock speed will do a better job when it actually runs than the old celeron. But I believe that gets pretty far down into the weeds.If both have integrated PCIe controllers, the difference will be minimal (at least it won't worth the higher price of the better CPU). But this difference changes between workunit batches (so it would be higher for 'GERARD_A2AR's) I would get out and push if I thought it would make it go any faster, so no, I did not optimize for power. I optimized for performance.Now I understand your performance optimization, but there's one more thing you can do to push your GPU harder: a working SWAN_SYNC environmental variable. There's some advice on the forum how to do it on Linux. Jeremy Zimmerman tested the effect of SWAN_SYNC, and published his results. And this GTX970 did better all alone on the celeron than it did fed by an i7 that had 75% of the cpu assigned to CPU tasks.The faster the GPU the more advisable not to do any CPU tasks on that host, and to choose a cheaper CPU with higher clocks (and less cores/threads). I've built my latest PC on these principals, and actually it can do any workunit faster than any other host on this project. There's only one way to build a faster host: to use a GTX TITAN X, but I consider that GPU not worth to buy because its price/performance ratio. The point I tried to make, and apparently failed, is that the processor does not appear to be offloading any floating point work from the GPU as was suggested and thus a faster processor is unlikely to boost GPU utilization. At least, even if it is, any offload is easily performed by a 1.6 GHz celeron in its sleep.If the CPU is fairly state-of-the-art (i.e. it has integrated PCIe), there won't be much difference. If it's not, there could be as high as 2 times. | |
ID: 42649 | Rating: 0 | rate: / Reply Quote | |
Has anyone tested the SWAN_SYNC variable verses using an app_config file in the project folder to assign 1 CPU core per task? | |
ID: 42651 | Rating: 0 | rate: / Reply Quote | |
I tried swan_sync a few years ago and it made no difference. I just tried it gain, setting both swan_sync and SWAN_SYNC since there is some confusion which to use, and it made no difference in either CPU usage or GPU usage.When the SWAN_SYNC is in effect, the CPU time is near to the run time. If the CPU usage of the ACEMD app is lower than a full core (or thread), then the SWAN_SYNC is ignored for some environmental reasons. I am pretty sure SWAN_SYNC was getting to ACEMD. It was in the environment for the acemd process according when I looked in /proc/1879/environ when 1879 was the PID for acemd.. My google foo is failing me but my memory, which is probably worse than my google foo, recalls that Linux ignores SWAN_SYNC and that Linux with or without the ignored SWAN_SYNC was as fast as Windows XP with SWAN_SYNC which was faster than windows XP without SWAN_SYNC which was faster than Windows <anything newer>. But, don't trust my memory ... I don't. Re: nanoprobe - see the links in Retvari's post. | |
ID: 42653 | Rating: 0 | rate: / Reply Quote | |
Has anyone tested the SWAN_SYNC variable verses using an app_config file in the project folder to assign 1 CPU core per task? Two completely different purposes. SWAN_SYNC would make the ACEMD application use extra CPU cycles, whether or not a core was free - it would simply overcommit the CPU and cause a lot of thrashing if the CPU was filled with other tasks. app_config (or simply reducing the number of cores BOINC is allowed to schedule) would make space available on the CPU, but do nothing at all to encourage ACEMD to use it. If you normally run your CPUs full to the brim, you should probably use both. | |
ID: 42675 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : low GPU utilization with recent Gerard CXCL12?