About the GERARD_A2AR batch

Message boards : Number crunching : About the GERARD_A2AR batch

Author	Message
Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 42683 - Posted: 25 Jan 2016 \| 9:18:45 UTC Last modified: 25 Jan 2016 \| 9:24:26 UTC
	I accidentally did a test with my i7-4790k + GTX 980Ti + WinXP x64 host. The CPU's PCIe3.0 bus was running at only 4x, so its speed was like a PCIe1 x16. I've experienced the following: 1. The GPU usage & temperature was slightly less then normal (so it run unnoticed for a couple of days) 2. The GPU & memory clocks were normal (1404MHz & 3505MHz) 3. The workunits' runtime went up by 123% (yes, more the doubled: 6h5m -> 14h14m) These results confirm that old MB & CPU should not be used with high-end GPUs to avoid such frustration by similar workunits in the future. (Also, the performance decrease caused by WDDM is higher than usual on these workunits)
	ID: 42683 \| Rating: 0 \| rate: / Reply Quote

Dayle Diamond Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level Scientific publications	Message 42687 - Posted: 25 Jan 2016 \| 23:13:43 UTC - in response to Message 42683.
	I had a few WU take much longer than average a few days ago, with no system changes. Got a regular amount of cobblestones, so I assume the length was unintentional. Back to normal now. Would you mind offering a few more details: How old was the motherboard running the 4790k? In my estimation, the 4790k isn't a very old CPU; what sort of motherboard were you plugging it into?
	ID: 42687 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 10,917,948,466 RAC: 15,767,919 Level Scientific publications	Message 42688 - Posted: 25 Jan 2016 \| 23:44:02 UTC - in response to Message 42683.
	I accidentally did a test with my i7-4790k + GTX 980Ti + WinXP x64 host. The CPU's PCIe3.0 bus was running at only 4x, so its speed was like a PCIe1 x16. I've experienced the following: 1. The GPU usage & temperature was slightly less then normal (so it run unnoticed for a couple of days) 2. The GPU & memory clocks were normal (1404MHz & 3505MHz) 3. The workunits' runtime went up by 123% (yes, more the doubled: 6h5m -> 14h14m) Compared it to my AMD Athlon(tm) 64 X2 Dual Core Processor 5000+, with a GTX 980Ti, and windows xp x32, the results are as follows: 1. GPU usage is under 70% and temperature is about 50C, which compares to other GERARD WUs about 95% usage at about 60C temperature. 2. Device clock : 1190MHz Memory clock : 3505MHz 3. Work unit run time is over 12 hours, compared to about 6 hours (plus or minus) for the other GERARD WUs. Even having several of these slow WUs in my average, my average completion time is currently in 2nd place: Rank User name Average (h) Total crunched (WU) GPU description of fastest setup 1 BurningToad 6.23333 12 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 355.98 2 Bedrich Hajek 6.37091 55 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 355.82 3 Xeaon 6.77059 17 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 361.43 4 Streetlight 6.95152 33 [2] NVIDIA GeForce GTX TITAN X (4095MB) driver: 358.50 5 syntech 6.98824 17 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 358.91 6 Retvari Zoltan 7.02595 185 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 358.50 7 Gamekiller 7.03636 11 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 361.43 8 whizbang 7.81600 25 [2] NVIDIA GeForce GTX 980 (4095MB) driver: 361.43 9 Kagura Kagami@jisaku 7.81818 11 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 361.43 10 Andree Jacobson 7.89167 12 NVIDIA GeForce GTX 980 (4095MB) These results confirm that old MB & CPU should not be used with high-end GPUs to avoid such frustration by similar workunits in the future. (Also, the performance decrease caused by WDDM is higher than usual on these workunits) Yes and no. Yes, it is frustrating. But why shouldn't high end cards be used in older computer? I am getting run times on the other GERARD WUs that are comparable to my new windows 10 computer. This brings up another question about future WUs. Should they be more or less depend on CPUs? Even fast computer pay a time penalty with the A2AR WUs, though it is much less than the older CPUs. If we want to have more efficient (faster) crunching, the WUs should be made less CPU depend, where possibly. As far WDDM penalty, which happens (correct me if I am wrong) when the GPU has to access the CPU after each step. Is it possibly to be able to have this access every other step or possibly every third? This should reduce the lag. I am not sure I have the right wording for this, but I hope you can understand what I am saying.
	ID: 42688 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 42689 - Posted: 26 Jan 2016 \| 0:17:23 UTC - in response to Message 42687. Last modified: 26 Jan 2016 \| 0:17:50 UTC
	How old was the motherboard running the 4790k? In my estimation, the 4790k isn't a very old CPU; what sort of motherboard were you plugging it into? It's in a Gigabyte GA-Z87X-OC motherboard, I don't know the exact age of this motherboard, but the Z87 chipset is almost 3 years old. This is not too old for a GTX 980 Ti. But when its PCIe bus was limited to 4x, it was acting like a really old motherboard with PCIe 1.0 x16.
	ID: 42689 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 42690 - Posted: 26 Jan 2016 \| 1:41:40 UTC - in response to Message 42688.
	Even having several of these slow WUs in my average, my average completion time is currently in 2nd place The present mix of workunits is easier on the CPU, but previously there were large batches (for example the NOELIA tasks) which were like the GERARD_A2AR batch now. These results confirm that old MB & CPU should not be used with high-end GPUs to avoid such frustration by similar workunits in the future. (Also, the performance decrease caused by WDDM is higher than usual on these workunits) Yes and no. Yes, it is frustrating. But why shouldn't high end cards be used in older computer? This is merely a forewarning to avoid the frustration could be caused by a large CPU demanding batch. I am getting run times on the other GERARD WUs that are comparable to my new windows 10 computer. True. But this thread is about the GERARD_A2AR batch, of which runtimes are ~70% longer on your older Athlon/WinXP host than on your i7-5820K/Win10. To put it in an even worse perspective: your older host's GERARD_A2AR runtimes are ~100% longer than of my WinXP/i3-4130 host. This brings up another question about future WUs. Should they be more or less depend on CPUs? Now that's the million dollar question. Even fast computer pay a time penalty with the A2AR WUs, though it is much less than the older CPUs. If we want to have more efficient (faster) crunching, the WUs should be made less CPU depend, where possibly. I think it's impossible from computing point of view. (Then we should use Double Precision enabled GPUs, which are very very expensive.)
	ID: 42690 \| Rating: 0 \| rate: / Reply Quote

disturber Send message Joined: 11 Jan 15 Posts: 11 Credit: 62,705,704 RAC: 0 Level Scientific publications	Message 42709 - Posted: 28 Jan 2016 \| 20:35:46 UTC - in response to Message 42690.
	(Then we should use Double Precision enabled GPUs, which are very very expensive.) If you can tolerate 1/4 DFP performance which is significantly better than 1/24 or 1/32 on crippled cards, then the only reasonable choice is the 7970 or 280x AMD card. It is the only one that produces high output in Milkyway that is specifically programmed for double precision floating point. These cards go for less than $200 on the used market. My computer recently downloaded a work-unit than normally take 9-10 hours on my GTX 970, and to my astonishment BOINC tells me that it will finish in 1d 03:16:51. It is the new chalcone229x2-GERARD_CXCL12_DCKCHALK. Is anyone else seeing this? It not a slow machine, pcie-e 3.0 is running at 8x and cpu is an i7-3770k overclocked to 4.4GHz. Is the credit commensurate with the long compute time? Not trying to hijack this thread.
	ID: 42709 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 6,193,153 Level Scientific publications	Message 42710 - Posted: 28 Jan 2016 \| 21:29:58 UTC
	If the Work Units are truly CPU limited on a 4790K, then you should see an increase in performance if you disable hyperthreading in the bios. The app would then have access to one full core rather than a single logical core.
	ID: 42710 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 42711 - Posted: 28 Jan 2016 \| 23:53:44 UTC - in response to Message 42710.
	If the Work Units are truly CPU limited on a 4790K, then you should see an increase in performance if you disable hyperthreading in the bios. The app would then have access to one full core rather than a single logical core. It's not CPU limited. It was PCIe bandwidth limited when the CPU's PCIe bus was running at 4x speed by accident.
	ID: 42711 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 42712 - Posted: 29 Jan 2016 \| 0:12:33 UTC Last modified: 29 Jan 2016 \| 0:14:49 UTC
	Actually there's a complete thread for a similar batch in the news topic, started by Gerard himself :) There is an explanation for the higher CPU usage of this batch: Gerard wrote: I forgot to note that due to the nature of these simulations, some small forces have to be added externally and, unfortunatenly, these have to be calculated using cpu instead of gpu. Therefore you may notice some amount of cpu usage that in my case never surpassed a 10%.
	ID: 42712 \| Rating: 0 \| rate: / Reply Quote

kingcarcas Send message Joined: 27 Oct 09 Posts: 18 Credit: 378,626,631 RAC: 0 Level Scientific publications	Message 42714 - Posted: 29 Jan 2016 \| 8:53:22 UTC Last modified: 29 Jan 2016 \| 8:54:49 UTC
	Good stuff, This is the sort of thing LinusTechTips over on Youtube does once in a while, how does it fare at 8x? ____________
	ID: 42714 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 42717 - Posted: 29 Jan 2016 \| 18:18:35 UTC - in response to Message 42709. Last modified: 29 Jan 2016 \| 18:19:12 UTC
	My computer recently downloaded a work-unit than normally take 9-10 hours on my GTX 970, and to my astonishment BOINC tells me that it will finish in 1d 03:16:51. It is the new chalcone229x2-GERARD_CXCL12_DCKCHALK. Is anyone else seeing this? I got a couple of those earlier this week. They started off with an estimate of a day and a half but finished in 10 hrs. How long did yours take? The other question raised is one I have been pondering lately about how does PCIe bandwidth affect performance. The general "hive consensus" to date has been that projects to date do not more than what is provided by PCIe1 x16. This was extensively tested by the bitcoin community. It is highly dependent on the project and workloads so different projects will have different requirements. My current generation of builds expect that a PCIe2 x8 slot can keep a GPU happy. This thread is starting to make me wonder if this is true. The cost of a system with 32 PCIe lanes and enough power to run two modern GPUs exceeds the cost of two basic systems with 16 PCIe lanes each and a modest PSU by a significant number. A hundred dollar bundle with a thirty dollar PSU will easily handle any single GPU. A system with 32 PCIe lanes on two x16 slots is easily a two hundred dollar motherboard plus memory and cpu and a hundred dollar PSU. Add onto that the issues cooling a system with dual 300 watt CPUs. So, it used to be we had to shop for motherboards that ran x16 single slot but dropped to x8/x8 if you added a second card instead of motherboards that ran x16 single slot and dropped to x16/x4 if you added a second card. Have we entered an era where x8/x8 isn't fast enough any more?
	ID: 42717 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 42719 - Posted: 29 Jan 2016 \| 20:43:38 UTC - in response to Message 42717.
	Have we entered an era where x8/x8 isn't fast enough any more? I wouldn't say that. WU batches come and go, some of them (for example the one this thread is about) are more CPU/PCIe bandwidth dependent. As a performance enthusiast I don't like to make compromises, so I wouldn't build a dual (multi-) GPU host for GPUGrid (though I have some). There's no point in spending the extra bucks for a more capable (s2011) MB and CPU (and cooling), if you have the space (and the will) to build two (or more) hosts and you don't want to see your host on the hosts' overall & RAC toplist.
	ID: 42719 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 10,917,948,466 RAC: 15,767,919 Level Scientific publications	Message 42727 - Posted: 31 Jan 2016 \| 11:48:40 UTC - in response to Message 42690.
	This brings up another question about future WUs. Should they be more or less depend on CPUs? Now that's the million dollar question. If WUs do become more CPU dependent, then having 2 or more CPUs feeding 1 GPU should offset this lag, unless this increases the PCIe bus traffic dramatically. This should also reduce the WDDM lag. Can this be done? If yes, how?
	ID: 42727 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 42758 - Posted: 5 Feb 2016 \| 18:12:16 UTC - in response to Message 42727.
	The science will determine the reliance on the CPU. That said, this is a GPU project and tries to do most of the work on the GPU, and the project is designed to utilize gaming GPU's. The problem identified in the OP wasn't with the CPU, it was (mostly) a PCIE x4 bottleneck. That said, the PCIE controller is on the CPU & the CPU does have to do some work. So there isn't an inherent need for another CPU (socket 2 on a dual CPU board) and if there was, then PCIE controllers would need to be separated across both CPU's. Perhaps the only way for WU's to avoid the WDDM overhead in recent MS OS's would be to use on-GPU CPU's. If this ever does become a reality then the co-processor would get a dedicated 'main' processor (a developmental flip). For now there is still XP, 2003/2003R2 server and Linux. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 42758 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : About the GERARD_A2AR batch

	About	Science	Volunteers	Performance	Forum	Join us	Donate