Advanced search

Message boards : Graphics cards (GPUs) : Ampere 10496 & 8704 & 5888 fp32 cores!

Author Message
eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55227 - Posted: 1 Sep 2020 | 23:03:41 UTC

Incredible amount of compute power compared to Turing!
38Tflops single 32bit precision and 30Tflops for 3080. 20Tflops on 3070.

New arch details forthcoming. Plenty of websites with current info.

New samsung? 8nm die lithography with 28 billion transistors (3090) die size unknown currently.
Previous 12nm is TSMC "ffn".
Memory clocks are faster than Turing.
PCIe4.0 Ampere Boost should be around 2ghz as previous PCIe3.0 Pascal and Turing did routinely.

Will GRUGRID be ready for this newest generation?

I'll be an early adapter. Looking at couple 3080 since 3090 offers 1800 more cores for 700$ premium.
Purchase new pci4.0 mb and +12 core CPU with the savings.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55229 - Posted: 2 Sep 2020 | 1:15:32 UTC - in response to Message 55227.

I didn't see any publication of the actual memory clocks. Just the bandwidths. Will wait and see what the actual specs and test results are once the actual cards are in the hands of testers. Just because the cards will be great pixel pushers for ray-tracing games doesn't mean they will produce the commensurate compute improvements.

And to fully utilize the new architectural differences between Ampere and Turing/Pascal means we will need new applications.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55230 - Posted: 2 Sep 2020 | 1:45:33 UTC - in response to Message 55229.
Last modified: 2 Sep 2020 | 1:49:26 UTC

Integer32 performamce same as fp32 and int32 Turing 1:1?

On gaming: the hardest pixel pushing 4k ray tracing is currently 3rd iteration of Metro game.

Certainly others are crisp too. Another game or demo 8k or video will showcase ampere chops. I have 4k now. 8k monitor right around the corner for mainstream purchase. 2020 Bleeding edge.
I was surprised at 10k or 8k <5k 3070 cores. Never mind ampere tensors or fp64 performance.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55231 - Posted: 2 Sep 2020 | 2:04:00 UTC - in response to Message 55229.


And to fully utilize the new architectural differences between Ampere and Turing/Pascal means we will need new applications.

The idea of going to "Wrapper" with ACEMD3 was to enable easier development of apps for new CUDA / Architectural releases.
Interested to see how easy this path will be....The holdup may be Nvidia and how fast they release the next CUDA Toolkit.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55232 - Posted: 2 Sep 2020 | 6:29:59 UTC - in response to Message 55231.

The trick will be to use all the new hardware in the best parallelization of the current and future searches.

If you are a game designer, you have new SDK's for gaming. But is there going to be a compute SDK right after release?

Best scenario would be yes, and even better would be some new automatic profilers for compute loads. That way you could just input the current source code and the profiler would spit out the new optimized code for the new hardware resources in
Ampere. Then look at the generated code and iterate another revision that is better and faster. Rinse and repeat.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55237 - Posted: 2 Sep 2020 | 11:22:40 UTC

What concerns me: will Turing quality control issues crop up on Ampere?
Out of 6 turing GPUs I purchased 4 died. Gigabyte 2070 in a day. Zotac 2060 lasted 5 months. evga 2080 in 2 months with another evga 2080 enduring 2 years.

All had 3 year warranty. I sold all the warranty replacements since I didnt want anymore Turings due to their high death rate.
Had Pascal 1080 gpu knelt after 27 months. My Evga 1070 still holding on after 4 years 24/7 running it would be retired if the Turings lasted. And no Maxwell's quit they just retired.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55238 - Posted: 2 Sep 2020 | 12:40:29 UTC - in response to Message 55227.
Last modified: 2 Sep 2020 | 12:48:43 UTC

I expect that the CUDA cores could be used for crunching will be only the half of that stated in the name of this thread.
Similarly to the GF116 architecture, where only the 2/3rd of the cores could be used for crunching (due to the dispatch unit/CUDA core (4/6) ratio):


I think the relative performance in computing compared to the RTX 2080Ti will be the following:

card cores performance RTX 2080Ti 4352 100.0% RTX 3090 10496 5248 120.6% RTX 3080 8704 4352 100.0% RTX 3070 5888 2944 67.6%
Perhaps a bit (say 10%) more (taking other factors in consideration).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55239 - Posted: 2 Sep 2020 | 15:23:03 UTC - in response to Message 55238.

I think the CUDa core counts might be a bit of marketing magic. They claimed in the article that the Ampere cores can do 2 operations per clock, which is effectively doubling the work done, on a single physical core. We’ll see how it shakes out for compute work.

I’ll probably grab a 3070 when they are released to compare to my 2080ti and existing 2070s.

I’m very interested in performance per watt of the new cards. Don’t forget that all the new cards, while performance is looking great, they are seeing a healthy increase in power consumption as well. The 2080ti was a 250W card, the 2080 was a 215W card, the 2070 was a 175W card. Now these new cards are 350W/320W/220W for 3090/3080/3070 respectively.

If the 3070 performs comparably to a 2080ti that’s about a 12% power efficiency boost, which is believable.

For CUDA compute work, the 2070 was as fast or a little faster than a 1080ti at much less power draw (250W vs 175W). So I could see the same thing happening again for 3070 vs 2080ti.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55242 - Posted: 2 Sep 2020 | 16:08:47 UTC - in response to Message 55238.

I expect that the CUDA cores could be used for crunching will be only the half of that stated in the name of this thread.

I think you will be correct also. I saw the published CUDA core counts and thought marketing nonsense. Unless they fundamentally changed the architecture design, I think they just doubled the physical core counts by the new 2 operands per cycle PAM memory operations.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55243 - Posted: 3 Sep 2020 | 19:47:50 UTC - in response to Message 55238.

Did some more reading and I think I got a better grasp on the whole CUDA core count issue and relative performance.

What’s being left to the footnotes is that comment that Jenson made in the announcement “two instructions per clock”. What he didn’t say was how the SM partitioning worked to allow this to happen, and that the doubling affects FP32 calculations only. They have a new datapath design where each SM partition can handle either 32x FP32 (which is double Turing) OR 16FP32 and INT32.

For graphics loads, it’s mostly FP32 so this works to their advantage, but for INT32 workloads, you won’t have this doubling effect, and the performance will be closer to Turing, with the normal generational efficiency boost.

A full size Ampere GPU core (GA100) is 128SMs with 64 CUDA cores per SM for a highest possible CUDA core count of 8192. The 3090, 3080, 3070 are not full size Ampere cores. Given that the marketing number of the 3090 is “10496” cores, we can surmise that it’s really 5248 CUDA cores with 82 SMs. This is still more SMs and cores than the 2080ti has (68 SMs/ 4352 Cores). However, the 3070 with its “5888” cores really has 2944 cores.

So it will depend on your workload. If you’re only living in the bubble of gaming, then yeah enjoy that 2x boost. But for other non-FP32 loads (like more purely computational loads), you’re going to likely only see normal generational improvements with Ampere over Turing.
____________

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55246 - Posted: 3 Sep 2020 | 23:28:58 UTC - in response to Message 55243.

Did some more reading and I think I got a better grasp on the whole CUDA core count issue and relative performance.

What’s being left to the footnotes is that comment that Jenson made in the announcement “two instructions per clock”. What he didn’t say was how the SM partitioning worked to allow this to happen, and that the doubling affects FP32 calculations only. They have a new datapath design where each SM partition can handle either 32x FP32 (which is double Turing) OR 16FP32 and INT32.

For graphics loads, it’s mostly FP32 so this works to their advantage, but for INT32 workloads, you won’t have this doubling effect, and the performance will be closer to Turing, with the normal generational efficiency boost.

A full size Ampere GPU core (GA100) is 128SMs with 64 CUDA cores per SM for a highest possible CUDA core count of 8192. The 3090, 3080, 3070 are not full size Ampere cores. Given that the marketing number of the 3090 is “10496” cores, we can surmise that it’s really 5248 CUDA cores with 82 SMs. This is still more SMs and cores than the 2080ti has (68 SMs/ 4352 Cores). However, the 3070 with its “5888” cores really has 2944 cores.

So it will depend on your workload. If you’re only living in the bubble of gaming, then yeah enjoy that 2x boost. But for other non-FP32 loads (like more purely computational loads), you’re going to likely only see normal generational improvements with Ampere over Turing.


Thanks for the analysis.
Will have to wait and see how they assign SMs to each GTX model to gauge the performance increase for each model.
Nvidia will not give their technology away, so I suspect you are right in saying it will only be a generational performance increase for us.

kain
Send message
Joined: 3 Sep 14
Posts: 152
Credit: 834,330,407
RAC: 4,302,360
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55279 - Posted: 11 Sep 2020 | 18:50:56 UTC

Well, I'm optimistic :)

https://wccftech.com/nvidia-geforce-rtx-3080-flagship-2x-faster-than-rtx-2080-in-opencl-cuda-benchmarks/

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55281 - Posted: 12 Sep 2020 | 2:27:45 UTC

Hopefully, the Amperes will be showing up on the Passmark GPU Direct Compute ratings soon.

https://www.videocardbenchmark.net/directCompute.html

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55292 - Posted: 14 Sep 2020 | 21:55:21 UTC

https://www.tomshardware.com/features/nvidia-ampere-architecture-deep-dive

Real fp32 cores confirmed. 64 fp32 per sm with 32 cores for fp32 only with remaining 32 core being concurrent int32 or
fp32. See Ampere slides. Compute benchmarks released (tomorrow?) will show how well new fp32 design performs.

Ampere Integer32 performance 50-66% of floating 32 depending on code efficiency.

Consumer Ampere ga102/104 (fp64) double precision (2 per sm) now 1/64 ratio of fp32. Turing has 1/32 (4 fp64 per sm).

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55302 - Posted: 16 Sep 2020 | 21:12:25 UTC

https://www.tomshardware.com/news/nvidia-geforce-rtx-3080-review

Tom's will reveal their compute benchmarks soon and show detailed power profiles.

Definitely an upgrade if own a 2080ti.

Are you purchasing an Ampere? If so which? For me 3080 and 3070 look good.
RTX 3090 the ultra halo card with an awful per core cost compared to 3080.

Note: GTX 3080 founders edition real power consumption similar to overclocked non founders 2080ti - demanding 330 watts.

Curious to what the non founders RTX 3090 pulls - my guess is ~400W overclocked.

Edit: noticed I made mistake in my previous post - meant to write Ampere has 128 fp32 cores per sm.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55304 - Posted: 17 Sep 2020 | 20:25:25 UTC - in response to Message 55302.

So far all the cards are power limited. Almost no overclocking potential. Maybe 2-3%. Doubt the 3090 cards will be any different.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55351 - Posted: 25 Sep 2020 | 12:57:08 UTC - in response to Message 55304.

So far all the cards are power limited. Almost no overclocking potential. Maybe 2-3%. Doubt the 3090 cards will be any different.

You are right even with enough power the oc wall is 2GHz - many forums reporting Ampere 3080 crashing at 2GHz.
Ampere release eery similar to Turing.

Remember when early model 2080 and 2080ti were glitching out at stock clocks? Ampere improved founders cooling didnt help. Might be memory related due to new technology. And/or quality control with the massive dies. 12nm Die density on Turing TU102 25m transistors per mm. 8nm TA102 has 45m per mm. 7nm and 5nm are more dense.

Finding a 3080 another story with limited availability. This is so bad Nvidia released a statement about availability.

RTX3090 reviews published have oced clocks power consumption at 450W with a 480W power limit. 360W at out the box clocks. Big power increase for little performance gain. (2) 2080TI cards had 3 8 pin power connectors: the msi lighting and galax. Now all 3090 and most 3080 have (3) 8 pins or the new 12 pin on founders.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55354 - Posted: 25 Sep 2020 | 17:21:18 UTC
Last modified: 25 Sep 2020 | 17:22:33 UTC

Until the compute apps get recompiled for CUDA 11.1 and the new PTX library, none are going to show the potential from using the dormant extra FP32 pipeline in the architecture.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55366 - Posted: 27 Sep 2020 | 15:25:33 UTC

https://www.techpowerup.com/272591/rtx-3080-crash-to-desktop-problems-likely-connected-to-aib-designed-capacitor-choice


rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55371 - Posted: 29 Sep 2020 | 5:56:11 UTC - in response to Message 55366.
Last modified: 29 Sep 2020 | 6:01:38 UTC

https://www.techpowerup.com/272591/rtx-3080-crash-to-desktop-problems-likely-connected-to-aib-designed-capacitor-choice


Nvidia has released driver 456.55 to fix the issue. (driver appears to lower the boost to prevent the crashes during games)
https://videocardz.com/newz/nvidia-geforce-rtx-3080-owners-report-fewer-crashes-after-updating-drivers

ASUS and MSI have modified their designs to fix the crashes by implementing different capacitor configuration.
https://videocardz.com/newz/asus-also-caught-modifying-geforce-rtx-3080-tuf-and-rog-strix-pcb-designs
and
https://videocardz.com/newz/msi-quietly-changes-geforce-rtx-3080-gaming-x-trio-design-amid-stability-concerns

Hopefully just a minor speed bump in the release of the new Ampere Architecture.

Plain sailing from here?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55372 - Posted: 29 Sep 2020 | 19:31:37 UTC

Plain sailing from here?

No, not at all. No compatible apps yet.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55373 - Posted: 29 Sep 2020 | 23:01:58 UTC - in response to Message 55372.
Last modified: 29 Sep 2020 | 23:42:04 UTC

Plain sailing from here?

No, not at all. No compatible apps yet.

Based on past Architectural changes and app upgrades to match have taken 6 months or more for the new app.

First Pascal GPU released May 2016, Gpugrid Pascal compatible app released November 2016.
First Turing GPU released September 2018, Gpugrid Turing compatible app released October 2019.
(Working from memory here, so please correct if the timeline is not right.)

However, the change to Wrapper introduced on the Turing app upgrade, may shorten the app development cycle.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 55374 - Posted: 30 Sep 2020 | 6:24:27 UTC - in response to Message 55373.

We don't have access to any card yet.

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 105,618,140
RAC: 344,153
Level
Cys
Scientific publications
wat
Message 55378 - Posted: 30 Sep 2020 | 11:31:15 UTC

Don't know if this has a lot of relevance for those of you currently considering to purchase or already having bought a RTX 30xx series card, but over at F@H they have recently been rolling out CUDA support on their GPU cores and were able to drastically increase the average speed of NVIDIA cards.

They ran an analysis on the efficiency gain of the cards produced by the CUDA support and included in one of their analysis an RTX 3080.
https://foldingathome.org/2020/09/28/foldingathome-gets-cuda-support/

I can't post pictures here, so just take a look at the third graph posted on this site. From what I can tell, the RTX 3080 achieves an improvement of 10-15% over a RTX 2080 Ti in this one particular GPU task they benchmarked the cards to. This might be interesting for benchmarking the performance here as well. Thought this might be of interest to some of you.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55382 - Posted: 30 Sep 2020 | 13:53:15 UTC - in response to Message 55374.
Last modified: 30 Sep 2020 | 14:08:02 UTC

We don't have access to any card yet.


don't think you really need a physical card to make the new app. just download the latest CUDA toolkit, and recompile the app with CUDA 11.1 instead of 10.

when I was compiling some apps for SETI, I did the compiling on a virtual machine that didn't even have a GPU. as long as you set the environment variables and point to the right CUDA libraries, you shouldn't have any trouble. you really can do it on whatever system you used before. just make sure you add the gencode variables sm_80 and sm_86 to your makefile so that the new cards will work.

so like this:
-gencode=arch=compute_80,code=sm_80
-gencode=arch=compute_80,code=compute_80
-gencode=arch=compute_86,code=sm_86
-gencode=arch=compute_86,code=compute_86


also give a look through the documentation for the CUDA 11.1 toolkit.

https://docs.nvidia.com/cuda/index.html
https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html

take note this quote:
1.4.1.6. Improved FP32 throughput

Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.



just make it a beta release, so those with an Ampere card can test it for you.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55383 - Posted: 30 Sep 2020 | 14:24:39 UTC - in response to Message 55378.

Don't know if this has a lot of relevance for those of you currently considering to purchase or already having bought a RTX 30xx series card, but over at F@H they have recently been rolling out CUDA support on their GPU cores and were able to drastically increase the average speed of NVIDIA cards.

They ran an analysis on the efficiency gain of the cards produced by the CUDA support and included in one of their analysis an RTX 3080.
https://foldingathome.org/2020/09/28/foldingathome-gets-cuda-support/

I can't post pictures here, so just take a look at the third graph posted on this site. From what I can tell, the RTX 3080 achieves an improvement of 10-15% over a RTX 2080 Ti in this one particular GPU task they benchmarked the cards to. This might be interesting for benchmarking the performance here as well. Thought this might be of interest to some of you.


i think we might see a bigger uplift in performance actually. it seems those tests might be seeing speedups from increased memory capacity and bandwidth, combined with the core speed. I think the 3080 is *only* 10-15% ahead of the 2080ti because of memory config being limited to only 10GB of memory

one thing that stands out to me, is that their benchmarks show the Tesla V100 16GB as being significantly faster than a 2080ti, but comparing empirical data from users with that GPU show it to actually be slower than the 2080ti at GPUGRID. this leaves me to theorize that GPUGRID doesnt utilize the memory system as much, and instead relies mostly on core power. this is further supported by my own testing where not only is GPU VRAM minimally used (less than 1GB), but increasing the memory speed makes almost no change to crunching times.

if we can get a new app compiled for 11.1 to support the CC 8.6 cards, we may very well see a large improvement to the processing times since comparing the core power alone, the 3080 should be much stronger.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55384 - Posted: 30 Sep 2020 | 15:22:44 UTC - in response to Message 55378.

Don't know if this has a lot of relevance for those of you currently considering to purchase or already having bought a RTX 30xx series card, but over at F@H they have recently been rolling out CUDA support on their GPU cores and were able to drastically increase the average speed of NVIDIA cards.
That's great news!

They ran an analysis on the efficiency gain of the cards produced by the CUDA support and included in one of their analysis an RTX 3080.
https://foldingathome.org/2020/09/28/foldingathome-gets-cuda-support/
That's the down to earth data of their real world performance I've been waiting for.

I can't post pictures here, so just take a look at the third graph posted on this site.
I can, but it's a bit large: (sorry for that, I've linked the original picture, at least the important parts are on the left)

From what I can tell, the RTX 3080 achieves an improvement of 10-15% over a RTX 2080 Ti in this one particular GPU task they benchmarked the cards to. This might be interesting for benchmarking the performance here as well. Thought this might be of interest to some of you.
That's the performance improvement I've expected. This 10-15% performance improvement (3080 vs 2080Ti) confirms my expectations about the 1:2 ratio of the usable vs advertised number of CUDA cores of the Ampere architecture.
This is actually a misunderstanding, as the number of the CUDA cores are:
RTX 3090 5248 RTX 3080 4352 RTX 3070 2944
but every CUDA core has two FP32 units in the Ampere architecture. Actually the INT32 part (existing in previous generations too) of the CUDA cores has been extended to be able to handle FP32 calculations. It seems that these "extra" FP32 units can't be utilized by the scientific applications I'm interested in.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55386 - Posted: 30 Sep 2020 | 15:55:49 UTC - in response to Message 55384.

but every CUDA core has two FP32 units in the Ampere architecture. Actually the INT32 part (existing in previous generations too) of the CUDA cores has been extended to be able to handle FP32 calculations. It seems that these "extra" FP32 units can't be utilized by the scientific applications I'm interested in.


this is something we can't know until GPUGRID recompiles the app for sm_86 with CUDA 11.1.

it will also depend on what kinds of operations GPUGRID is doing. if they are mostly INT, then maybe not much improvement, but if they are FP32 heavy, we can expect to see a big improvement.

it's just something that we need to wait for the app for. as I mentioned in my previous posts, you can't rely on that benchmark very strongly. there are performance inconsistencies already between what it shows and what you can see here at GPUGRID (specifically as it relates to the performance between the V100 and the 2080ti)
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 105,618,140
RAC: 344,153
Level
Cys
Scientific publications
wat
Message 55388 - Posted: 30 Sep 2020 | 16:28:31 UTC
Last modified: 30 Sep 2020 | 16:29:38 UTC

Just meant this as a pointer. Don't know what data they included in their analysis in the sense of how many GPUs were flowing in this data aggregation. Honestly, I am in way over my head with all this technical talk about GPU application porting, wrappers, code recompiling, CUDA libraries etc., even though I am trying to read up on it online as much as I can to get more proficient with those terms.

However, as F@H is currently arguably the largest GPU platform/colection of (at least consumer-grade) GPU cards, I would reckon that the numbers shown here are as representative as they can get due to those values being averaged out over a bunch of various card makes for each gen. So I'd argue that this value is a solid benchmark. At least for 1) now, 2) without further optimisation for the Ampere infrastructure and 3) for this specific task at F@H. I'd suggest to ask in their forums about the data that was used to derive those values and what those task requirements have been to induce valuable information for a possible RTX 30xx series performance on GPUGRID.

Sigh.... I wish I had the problem of figuring out how much performance is set to be gained of a 3080 over a 2080 Ti. For the moment I am stuck with the 750 Ti (fourth last)...

And thanks for including this graph here Zoltán!

Good luck with your endeavour!

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55390 - Posted: 30 Sep 2020 | 17:54:22 UTC - in response to Message 55386.
Last modified: 30 Sep 2020 | 18:02:42 UTC

this is something we can't know until GPUGRID recompiles the app for sm_86 with CUDA 11.1.
That's why I started that sentence by "It seems..."

it will also depend on what kinds of operations GPUGRID is doing. if they are mostly INT, then maybe not much improvement, but if they are FP32 heavy, we can expect to see a big improvement.
It's the latter (FP32 heavy). However it's not the type of the data processed that decides if it can be processed simultaneously by two FP32 units within the same CUDA core, but the relation of the input-output data of that process. This relation is more complex than what I can comprehend (as of yet).

it's just something that we need to wait for the app for. as I mentioned in my previous posts, you can't rely on that benchmark very strongly.
The Folding@home app and the GPUGrid app are very similar regarding the data they process and the way they process it. Therefore I think this benchmark is the most accurate estimation we can have for now regarding the performance of the next GPUGrid app on Ampere cards. In other words: it would be a miracle if these science apps could be optimized to utilize all of the extra FP32 units.

there are performance inconsistencies already between what it shows and what you can see here at GPUGRID (specifically as it relates to the performance between the V100 and the 2080ti)
The largest "inconsistency" is that the RTX 3080 should have ~1500ns/day performance if the 8704 FP32 units could be utilized by the present FAH core22 v0.0.13 app. I assume that the RTX 3080 cards run nearly at their power limit already. Processing data on the extra FP32 units would take extra energy; however the data flow would be closer to optimal in this case, so it would lower the energy consumption at the same time. The ratio of these decide the actual performance gain of an optimized app, but -- based on my previous experience -- an FP32 operation is more energy hungry, than loading/storing the result. So, as being optimistic on that topic I expect another 10-15% performance boost at best from an Ampere-optimized app.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55391 - Posted: 30 Sep 2020 | 18:00:01 UTC - in response to Message 55390.

was the F@h app compiled with SM_86? I couldn't find that info. as referenced in my previous comment from the nvidia documentation, that might be needed to get the full benefit from the double FP32 throughput.

which is why I hope the app update includes the SM_86 gencode.
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55395 - Posted: 30 Sep 2020 | 20:13:53 UTC

Toni, what about that suggestion of a "blind compile" for a beta app? Maybe the gambling pays off and it works right away or after minimal debugging?

Very good point about the recompile. If the compiler does not try to use the extra FP32 units, then it's clear they won't be used. From the data they have it looks like the current F@H app was not compiled for Ampere, even though their scaling with GPU power seems way weaker than here at GPU-Grid. Maybe they have more sequential code.

Generally I'm not as pessimistic as Zoltan about using those extra FP32 units. In no review have I read anything about restrictions on when they can be used (apart from the compiler). It's nothing like the superscalar units in Kepler and smaller Fermis, which were really hard to use due to being limited to thread level parallelism.
However, there's an obvious catch: any other operation that the previous Turing SM could issue along FP32 (Int32, load, store, whatever) is also still going to be needed. And if these execute, the SM is already at maximum throughput and nothing can be gained by the additional FP32 units. Just from this a speedup in the range of 30 - 60% may be expected, but certainly not 100%. Just taking those numbers out of my stomach and random pieces of information I have seen here and there.

MrS
____________
Scanning for our furry friends since Jan 2002

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55396 - Posted: 30 Sep 2020 | 20:59:16 UTC

I think the existing app source code can just be recompiled with the new genarch parameters to get the app working like it already does with previous generations.

To get the app working and use the extra FP32 pipeline, I think the app will need to be rewritten to add the PTX ISA JIT compiler that is in the new 11.1 CUDA toolkit. That seems to be the key component to parallelize the code to get the extra FP32 register working on compute.

But just with the single FP32 register in play, we should see some pretty good enhancement just from the architecture changes.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55410 - Posted: 1 Oct 2020 | 18:25:49 UTC - in response to Message 55391.

was the F@h app compiled with SM_86? I couldn't find that info. as referenced in my previous comment from the nvidia documentation, that might be needed to get the full benefit from the double FP32 throughput.

which is why I hope the app update includes the SM_86 gencode.

What I can see of the finished PPSieve work done by the 3080, the app is still the same one introduced in 2019.

So well before the release of CUDA 11.1 or the Ampere architecture.

I have no idea why it can run with the older application.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55411 - Posted: 1 Oct 2020 | 19:45:52 UTC - in response to Message 55410.

I have no idea why it can run with the older application.

If the app is coded "high level enough" I could see this work out. If nothing hardware or generation specific is targeted, all you do is to tell the GPU "Go forth and multiply (this and that data)!".

MrS
____________
Scanning for our furry friends since Jan 2002

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55412 - Posted: 1 Oct 2020 | 23:32:14 UTC

I guess so. They must not have encoded a minimum or maximum level that can be used. Not looking for some specific bounds like the app here does and is currently finding the Ampere cards out of range.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 180,567
Level
Trp
Scientific publications
watwatwat
Message 55420 - Posted: 4 Oct 2020 | 11:43:54 UTC

At 320 Watts the PCIe cables and connectors will melt. I make my own PSU cables with a thicker gauge wire but those chintzy plastic PCIe connectors can't handle the heat. If you smell burnt electrical insulation or plastic that's the first place to check.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55423 - Posted: 4 Oct 2020 | 16:40:27 UTC

A standard 8 pin PCIE cable can provide 150W. All the cards have either two or three 8 pin connectors which provide 300 or 450 watts PLUS the 75 watts from the PCIE slot. The founders edition cards have that new 12 pin connector which is just a duplicate capacity of two 8 pin connectors. Only six wires have 12V on them, the rest are grounds.

The plastic connector housing has nothing to do with current capability. It is the wires and pins the provide that.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 9,810,412,024
RAC: 20,644,706
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55424 - Posted: 4 Oct 2020 | 19:32:33 UTC - in response to Message 55423.
Last modified: 4 Oct 2020 | 19:43:31 UTC

Pulling the blanket from the link in this eXaPower post...
I arrived to this other interesting article about the new Nvidia 12 pin power connector.
As Keith Myers remarks, it isn't a mere evolution of 6/8 pin current connectors, but a complete new specification.
This new specification includes connector's shape, constructive materials, wires and pins gauges, and electrical ratings.
On the paper, it might handle as much as 600 Watts...
To general public: Please, don't test it with a bent paperclip! ;-)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55425 - Posted: 4 Oct 2020 | 21:51:00 UTC - in response to Message 55420.

At 320 Watts the PCIe cables and connectors will melt.

Have you checked the launch reviews and for that? And asked Google about forum posts from affected users?
Tip: don't spend too much time searching.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55426 - Posted: 4 Oct 2020 | 22:25:03 UTC - in response to Message 55424.
Last modified: 4 Oct 2020 | 22:40:18 UTC

I arrived to this other interesting article about the new Nvidia 12 pin power connector.
As Keith Myers remarks, it isn't a mere evolution of 6/8 pin current connectors, but a complete new specification.
This new specification includes connector's shape, constructive materials, wires and pins gauges, and electrical ratings.
On the paper, it might handle as much as 600 Watts...
According to that paper, it has 80 µm tin plated high copper alloy terminals.
Seriously, tin plated? It's not better than the present PCIe power connectors, it's just smaller.
The SATA power connector has 15 gold plated pins:
3x +3.3V
3x GND
3x +5V
3x GND
3x +12V
It has to deliver typically under 1A on each rails. (yes, it's distributed between the 3 terminals that rail has).
Now the "revolutionary" 12-pin connector has to deliver 600 Watts (600W/12V) 50 Ampers on the 6 12V terminals, that's 8,33 Ampers on each terminal. If I take the typical load of 300W, it still has to deliver 4.17 Ampers on each tin plated terminal.
Now, tin has 12 times higher contact resistance than gold: (note that the coordinates are logarithmic)

12 times higher contact resistance * 12 times higher current =
1728 times higher power converted to heat (compared to the SATA power connector)
After a year of crunching 24/7, these connectors will burn like the good old PCIe power connectors did.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55427 - Posted: 5 Oct 2020 | 0:21:55 UTC - in response to Message 55426.

They carry a higher power rating because the base wiring spec is that of a higher gauge than the old cables.

The connector gets hot as a consequence of the small gauge wiring getting hot.

I’ve been crunching a long time with many high power GPUs and never had a cable or connector melt.

The 12-pin connector (or 2x 6-pin, or 2x 8-pin) is totally sufficient if you’re using a modern PSU that’s already built over spec with 16ga wire. if you’re using some cheap PSU with barely spec wiring, then yeah you might have issues. But most PSUs from a reputable brand like EVGA or Corsair or Seasonic, you’ll never have an issue unless you do something stupid like using cheap SATA or MOLEX to PCIe adapters


____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55428 - Posted: 5 Oct 2020 | 0:38:15 UTC

Yes, the new 12 pin connector is using 16 ga. wire which reduces the circuit resistance and has higher ampacity. The main thing to avoid is heating either the wire or the pins up as that can cause a runaway thermal condition which just increases the contact resistance which increases the heating and it just keeps building until failure.

Just moving from cheap 20 ga wire in typical power supply cables to the new 16 ga wire gains you an increase of 2X capacity. From 11A for 20 ga. to 22 A for 16 ga.

I agree it would have been better to have silver, silver-cadmium or gold plating but contact mechanical grip also is a big factor. The more surface area in contact with the mating connector and the higher the gripping and insertion force there is, the lower the overall contact resistance and the lower amount of heating due to current transmission.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 9,810,412,024
RAC: 20,644,706
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55429 - Posted: 5 Oct 2020 | 11:29:18 UTC

American Conductor Stranding - AWG Table

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 180,567
Level
Trp
Scientific publications
watwatwat
Message 55430 - Posted: 5 Oct 2020 | 20:49:44 UTC
Last modified: 5 Oct 2020 | 20:59:17 UTC

I've gone through hundreds of GPUs in the 20 years I've been doing this and I've had a fair number of PCIe cables burn up at both the PSU end and the GPU end. The biggest problem is due to the wimpy crimps. Sometimes you can use an extraction tool and remove the terminal pin from the plastic connector and the terminal pin is just flopping around on the wire. Heats up like Keith described and the plastic melts. I found examples that I hadn't thrown in the trash yet. I'll photograph them and post later.

Also all my PSUs are Corsair AX1200s with a few AX860s and TX750s. I got the AX1200s back when I thought it was a good idea to run 4 GPUs on a single motherboard. I now use only one or two and they must be on 16x 3.0 sockets. I believe PSUs are most efficient at 80% load so I'm under loading them now but I use what I have.

When I make cables this is the 16 gauge wire I use: https://mainframecustom.com/shop/cable-sleeving/lc-custom-16awg-cable-sleeving-wire-black-25ft/

This is my crimping tool: https://mainframecustom.com/shop/cable-sleeving/cable-sleeving-tools/mc-ratchet-crimper/

My terminal extraction tool: https://mainframecustom.com/shop/cable-sleeving/mainframe-customs-terminal-extractor/

I don't use sleeving as it seems it would just trap heat. Just an inch of heat shrink tube near each connector to bunch the wires together.

Anyone remember the Sapphire Radeon HD 5970??? It was notorious for frying cables. http://www.vgamuseum.info/media/k2/items/cache/8dd1b15b07ee959fff716f219ff43fb5_XL.jpg

Thanks for the links, I'll read those articles. I'm curious if one needs a new model PSU to use the 12-pin connector.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55433 - Posted: 6 Oct 2020 | 0:16:54 UTC - in response to Message 55430.

I got the AX1200s back when I thought it was a good idea to run 4 GPUs on a single motherboard. I now use only one or two and they must be on 16x 3.0 sockets. I believe PSUs are most efficient at 80% load so I'm under loading them now but I use what I have.


I run upwards of 10 GPUs on the same board. no problems if you know what you're doing and plan it out correctly. 4 is a cakewalk. the biggest thing to be aware is that if you are using 3 or more high power GPUs on the same board, and you are getting power to them all from the motherboard, to make sure the motherboard has a dedicated power connection for the PCIe power. that's how you burn out the 24-pin 12v lines if you dont have that or dont plug it in.

according to my PCIe bandwidth testing, GPUGRID (at current time) performs best on a PCIe 3.0 x8 or larger link. x16 isn't absolutely necessary. There's zero slowdown going from x16 to x8 on gen3.

my previous 10-GPU (10x RTX 2070 pic: https://i.imgur.com/7AFwtEH.jpg?1) system ran on a board with 10x PCIe 3.0 x8 links (Supermicro X9DRX+-F) I ran this system for well over a year 24/7 mostly on SETI and moved to GPUGRID after it shut down. not a melted connector in sight. it was even getting slot power to each GPU from the motherboard (with the help of some power connections added via risers). x8 lanes to each GPU. this system ran on a Corsair 1000W PSU which powered the motherboard/CPUs, and 2x 1200W HP server PSUs providing all of the power connections to the GPUs (each PSU powering 5x GPUs). this system has since been converted to a 5x 2080ti system on an AMD epyc platform with CPU PCIe lanes to every slot. four(4) x16 slots, three(3) x8 slots). and every slot can be bifucated to x4 or x8. i'll probably expand it to 7x RTX 2080ti and leave it there (until I decide to update to 3080 or 3070 cards)

I have another system (currently offline, pics: https://imgur.com/a/ZEQWSlw) with 7x RTX 2080, running at 200W each. all from a single EVGA 1600W PSU. it was running for months before I turned it off for some renovations in the room where the computer was running. again no melted connectors. even using single pigtail leads leads to each GPU (8-pin + 6-pin single cable). this is on an old X79 board with only 40 CPU based PCIe lanes, but the board uses several PCIe PLX switches to kinda sorta give more lanes. it works with minimal slowdown. ASUS P9X79E-WS mb.

my 3rd system is currently 7x RTX 2070 (pic: https://i.imgur.com/136DaqP.jpg), again on the same AMD Epyc platform (AsrockRack EPYCD8 mb) as the 5x 2080ti system, but here all 7 slots filled. these particular RTX 2070 cards run a single 8-pin power plug, with the cards limited to 150W. 3x GPUs powered by the 1200W PCP&C PSU, and 4x GPUs powered from a 1200W HP server PSU I'm actually going to add an 8th card to this system by breaking one of the x16 slots to x8x8 and put two cards in one slot. I don't expect any issues. i have an x8x8 bifurcation riser on the way and the board fully supports slot bifurcation.

if you plan it right, it works. this is much more energy efficient and simpler to manage not having 10 separate systems each with their own PSU(s) and CPU/mem sucking up power. you could go the USB riser "mining" route if GPUGRID wasnt so reliant on PCIe bandwidth. but "c’est la vie", and the wiring is cleaner with the ribbon cables anyway ;)







____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55434 - Posted: 6 Oct 2020 | 0:26:08 UTC

Still using Corsair AX1200 power supplies running 3 gpus on two hosts. It is a good supply.

Never ran AMD cards so far. The Nvidia cards I have owned have been moderate power consumers mainly. Never overheated a cable or connector yet.

Good for you making your own cables. You are using good tools and wire it seems.
I have all kinds of crimping tools myself. I build custom cables for telescope mounts.

I believe you that you have examples of burnt wires and connectors. I have seen many posted examples of burnt connector myself. I have always used good cables and supplies and keep the loads within reason.

I run two hosts with 4 gpus each on EVGA 1600T2 power supplies pulling over 1100W with no issues for years now.

Lots of quality modular power supply makers are providing either a free upgrade 12 pin Nvidia cable or for a modest charge.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55435 - Posted: 6 Oct 2020 | 1:16:37 UTC
Last modified: 6 Oct 2020 | 1:20:02 UTC

Never had burnt 8 pin or 6 pin but always mechanical stuff die. Pumps or fans plus the GPU die. Like I said I had 4 Turing go cold out of 6. Overclocking the core shouldn't kill it all. Other generations didnt bed so easy.

I'd rather run a motherboard with 5 GPU or more. The mb lasted 5 years with 5 GPUs. Unexpectedly happens no matter what the best psus you run. I have 2 1200W psu on 1 motherboard. GPU maybe dont like win 8.1 The Turing that died should last offering a 3 year warranty. GPU Turing years lifespan length is lottery. Sometimes next generation better to wait for. (A few weeks past the first release.) Ampere release a mess. Worst in awhile. Good luck buying one today.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55436 - Posted: 6 Oct 2020 | 1:29:40 UTC - in response to Message 55435.

I have had only one gpu electrical death in probably 30 cards or so in ten years. But the pumps in the hybrid gpus die fairly regularly.

Nine Turing cards in use and running well so far.

Not too annoyed at the Ampere availability as they don't work here yet anyway.

Once there is an app that will run Ampere properly I might start getting a little antsy.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55548 - Posted: 11 Oct 2020 | 22:19:07 UTC

https://www.phoronix.com/scan.php?page=article&item=nvidia-rtx3080-compute&num=1

https://www.phoronix.com/scan.php?page=article&item=blender-290-rtx3080&num=1

The 3080 is faster at more power than 2080ti at various benchmarks.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55549 - Posted: 11 Oct 2020 | 23:58:35 UTC - in response to Message 55548.

https://www.phoronix.com/scan.php?page=article&item=nvidia-rtx3080-compute&num=1

https://www.phoronix.com/scan.php?page=article&item=blender-290-rtx3080&num=1

The 3080 is faster at more power than 2080ti at various benchmarks.


Plenty of good information on the above links. The graph that really stood out for me was this (courtesy of Phoronix):
https://openbenchmarking.org/embed.php?i=2010061-PTS-GPUCOMPU18&sha=d2fb7b5c9b53&p=2
It shows if you pump plenty of power into a gpu, you get results, but at the expense of efficiency. The graph highlights that the rtx3080 card is not very efficient.
Lets hope we see more efficient cards from Ampere in the coming months.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55550 - Posted: 12 Oct 2020 | 0:33:57 UTC

Yes, not very efficient at all. On Windows at least with the standard gpu control utilities you can shove the power usage down and cut the boost clocks and save some power.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55552 - Posted: 12 Oct 2020 | 13:39:46 UTC - in response to Message 55550.

Yes, not very efficient at all. On Windows at least with the standard gpu control utilities you can shove the power usage down and cut the boost clocks and save some power.
You can do it under Linux as well. You know that.

Judging by the relevant benchmarks:




For MD simulations there's no point to upgrade the RTX 2080Ti to the RTX 3080.
You get more performance in direct ratio of the power consumption.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55556 - Posted: 12 Oct 2020 | 16:23:37 UTC

I do know that. I was being terse and the main advantage of Windows utilities is the ability to undervolt.

You can undervolt and save a lot of power and not have any appreciable drop in performance or increase in crunching times.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55565 - Posted: 12 Oct 2020 | 18:33:09 UTC - in response to Message 55552.

Yes, not very efficient at all. On Windows at least with the standard gpu control utilities you can shove the power usage down and cut the boost clocks and save some power.
You can do it under Linux as well. You know that.

Judging by the relevant benchmarks:

...snip...

For MD simulations there's no point to upgrade the RTX 2080Ti to the RTX 3080.
You get more performance in direct ratio of the power consumption.


these aren't entirely relevant since FAH bench is an OpenCL benchmark and the apps here use CUDA.

there is a bit of overhead in the conversion from OpenCL to CUDA, which is why CUDA runs faster on nvidia GPUs. FAH bench might be doing the same kinds of research and underlying calculations, but since it's not using CUDA it's not directly comparable to the CUDA apps here.

a properly made CUDA 11.1 app should see significant performance increases, leveraging the double FP32 core architecture. the OpenCL apps might not be able to do the same.

i still believe that when such an app comes, the 3080 WILL be better "per watt" than the 2080ti.

____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55580 - Posted: 12 Oct 2020 | 22:32:49 UTC - in response to Message 55565.

i still believe that when such an app comes, the 3080 WILL be better "per watt" than the 2080ti.
I don't doubt that you believe in it. I believe in it to some (much lower) extent as well.
However the extra FP32 cores are aimed at rasterization and/or raytracing tasks, as their data set could be easily processed this way, while molecular dynamics simulations are (most likely) not.
Therefore I strongly suggest that no-one should invest in RTX 2080Ti -> RTX 3080 upgrades (for crunching) before we can do a reality check on every detail (with a CUDA 11 app), as all that is revealed so far says that the costs of such an upgrade won't return in the form of lower electricity bills or higher performance at the same running costs.
(If someone is interested, I can post my arguments (again...) but it's rather TLDR, plus I don't like to repeat myself.)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55581 - Posted: 12 Oct 2020 | 22:50:40 UTC - in response to Message 55580.

From my reading of tech deep-dive at AnandTech on the Ampere arch, the extra *new* FP32 register is just another generic register like the original in Turing.

Not anything to do with rasterization or Tensor cores. The dual purpose INT32/FP32 register can do EITHER integer or floating point calcs. Not both at the same time.

And since we don't do INT32 calcs for GPUGrid I believe, we are never going to use the INT32 function of the register.

Rasterization on a gpu involves integer instructions. But we are not blitting pixels to a screen, so nothing to worry about. That statement is valid only for a card not driving a display. If you only have the single card doing both drive display and crunching, that register will be busy driving pixels.

However, if you are running other projects on the card, Einstein for example, I believe it's tasks do a considerable amount of INT32 calcs at the beginning of the task. So trying to run both projects at the same time should hamper the performance of GPUGrid work.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55582 - Posted: 12 Oct 2020 | 23:22:22 UTC - in response to Message 55580.
Last modified: 12 Oct 2020 | 23:49:03 UTC

FP32 is FP32. Doesn’t matter what they are used for. They can be used for that operation.

Ampere has twice the number of FP32 cores than Turing. And if all you’re doing is FP32, all can be used at the same time. That’s where the 2x performance figures come from. If the workload is primarily FP32 operations, then you can expect to see major speed improvements. This has been widely reported.

Half of the FP32 cores can be used every cycle (the same number as Turing) . Half of them are “extra” added in a separate data pipeline. These are the ones that are split between FP32/INT32. The only way you won’t be able to use them, is if your workload is using a significant percentage of INT32 operations, and/or your application isn’t compiled in a way to take advantage of the additional pipeline.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55611 - Posted: 16 Oct 2020 | 15:04:51 UTC - in response to Message 55582.
Last modified: 16 Oct 2020 | 15:06:44 UTC

FP32 is FP32. Doesn’t matter what they are used for. They can be used for that operation.
Ampere has twice the number of FP32 cores than Turing. And if all you’re doing is FP32, all can be used at the same time.
Provided that they are working on the same piece of the data.

That’s where the 2x performance figures come from. If the workload is primarily FP32 operations, then you can expect to see major speed improvements. This has been widely reported.
I would add: in the case of some special workloads.

Half of the FP32 cores can be used every cycle (the same number as Turing).
Half of them are “extra” added in a separate data pipeline.
Actually quite the opposite: the "extra" FP32 cores are exactly in the same data pipeline, as the "original" FP32 cores.
The "original" FP32 core and the "extra" FP32 core reside within the same CUDA core. Just like the FP32/INT32 in Turing: the same core can't do a FP32 and INT32 operation simultaneously; unless they operate on the same piece of the data, but it's highly unlikely that such an "combo" (FP32+INT32) operation is ever needed. But in some cases two simultaneous FP32 operation on the same piece of data could be useful.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55612 - Posted: 16 Oct 2020 | 15:29:32 UTC - in response to Message 55611.
Last modified: 16 Oct 2020 | 15:57:51 UTC

Sorry to say, but you are misinformed or have misinterpreted the specs. The 2x FP32 is the new data path.

FP32+INT32 performance is the same as Turing

FP32+FP32 is where there is increased performance with Ampere. Pure FP32 loads will see the most benefit.



https://www.tomshardware.com/amp/features/nvidia-ampere-architecture-deep-dive

You seem to be misinformed on Turing operation also. Turing absolutely can do FP32+INT32 simultaneously. That was the big feature of the Turing architecture when it was released.

https://developer.nvidia.com/blog/nvidia-turing-architecture-in-depth/

The Turing architecture features a new SM design that incorporates many of the features introduced in our Volta GV100 SM architecture. Two SMs are included per TPC, and each SM has a total of 64 FP32 Cores and 64 INT32 Cores. In comparison, the Pascal GP10x GPUs have one SM per TPC and 128 FP32 Cores per SM. The Turing SM supports concurrent execution of FP32 and INT32 operations




Nvidia realized that having an entire path dedicated to INT32 was not necessary in most cases since INT operations were only needed about 30% of the time. So they combined FP32/INT into its own path, and added a dedicated FP32 path to get more out of it.
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55613 - Posted: 16 Oct 2020 | 18:46:12 UTC

I think Zoltan is a bit vague, that's why I find his point hard to understand. What I think he means: let's ignore the INT32 for a moment and only focus on the additional FP32 units. Which data can they work on?

1) Do they work on the same data as the 1st set of FP32 units? If so, this would severly limit their usefulness to special code containing consecutive independent FP32 instructions.

2) Or do tey work on additional data, making them universally usable?

By data I'm referring to a "warp" here, a group of 32 pixels or numbers which move together through the GPU pipeline and onto which the same instructions are executed.

If case 1) was true we'd have a super scalar architecture, like Fermi (apart from the biggest chips) and Kepler. There nVidia realized the utilized of the additional superscalar units was so bad, it actually wasn't worth it and fixed this in Maxwell.
In Kepler each SM had 4 "data paths" for 4 warps, each with 32 "pixels". These can make good use of 128 FP32 units. In addition there were 50% more / 64 superscalar FP32 units, which could be used in case any of the 4 warps contained such independent instructions.
Going to Maxwell nVidia changed this to just 4 warps and 128 FP32 per SM. Performance per SM dropped by ~10%, but they were able to pack more of them into the same chip area, more than offetting this loss.

I don't think tey went back to a superscalar design. Instead I think the total number of warps through a SM has doubled from Turing to Ampere. This also matches the doubled L1 bandwidth to feed this beast. Without double the throughput per SM this wouldn't make sense.

MrS
____________
Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55643 - Posted: 27 Oct 2020 | 20:38:57 UTC

RTX3070 is here and should appear in shops in 2 days. Availability could be better than for the bigger cards, but demand will probably be very high.

MrS
____________
Scanning for our furry friends since Jan 2002

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55644 - Posted: 28 Oct 2020 | 0:10:36 UTC

Waiting for the compute results for the cards. Would be the correct comparison of compute performance against the RTX 2080 Ti since the wattages are comparable.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55645 - Posted: 28 Oct 2020 | 15:34:43 UTC

we need a proper CUDA 11.1+ app from GPUGRID before we can compare anything
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55648 - Posted: 29 Oct 2020 | 9:38:50 UTC - in response to Message 55645.

Sure. I was just mentioning the new release as it may make it easier for Toni to get access to a card.

MrS
____________
Scanning for our furry friends since Jan 2002

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55649 - Posted: 29 Oct 2020 | 15:22:44 UTC

Well I tried and failed to get my hand on Ampere cards once again.
I wanted to buy 2. No thanks am not going pay 1000$ or 1200usd to a Ebay scalper for 3070 or 3080.

If you haven't heard: Nvidia is saying no ample 3080 3090 supply until 2021.

The Rtx 3070 was sold out instantly this morning at microcenter store in Cambridge MA.
(Same place I purchased a 2080ti openbox Asus rog cod edition for under 800$.)

Amazon and newegg are sold out too. Better luck next time. I'd like to test 3080 and 3070 out against the 2080ti.

Unbelievable how much demand there is for Ampere.
I arrived 30 mins before opening. Crazy how people waited overnight.

Funny thing is among all the creators and gamers only a few have heard of BOINC platform.
Many knew about folding at home and of course coin mining. Sadly none heard about GPUGRID.

All this hoopla reminds me of old apple phone releases.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55650 - Posted: 29 Oct 2020 | 20:16:51 UTC - in response to Message 55649.
Last modified: 29 Oct 2020 | 20:20:06 UTC

you have to go in-person (and usually at least a day in advance, waiting in line like black friday) to get the cards from Microcenter. you can't buy 30-series online.

one of the better methods for online shopping (in the US) is the queue system at EVGA's website. that's what I did this morning. they had a launch-day specific sku with a free t-shirt on their 3070 Black model for $499. they only had a limited number from what I can tell since you could only be added to the queue for about 5 minutes from about 6:05-6:10am PST.

I'm in line for both the launch-day 3070 Black and the normal 3070 Black. I expect to get my email to buy the launch day card either today or tomorrow. You can still get in the queue for the 3070 or 3080, or 3090, but just know that you'll have thousands ahead of you at this point so you might be waiting a long time before you are able to actually buy it.

don't even bother trying to buy from Best Buy or Newegg. they don't have any bot protections like EVGA does. most of the cards sold there this morning went to the bots.

but even if you have a Ampere card in-hand. you will be unable to test on GPUGRID until the admins/devs release a new application that supports the new cards. A few people have tried already and the tasks just error out immediately because the current app doesnt support them.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55651 - Posted: 30 Oct 2020 | 0:53:38 UTC - in response to Message 55650.

well. my email came, order placed.

EVGA RTX 3070 Black on the way :)

now just need to wait for GPUGRID to release a new app!
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55652 - Posted: 30 Oct 2020 | 2:32:17 UTC - in response to Message 55651.

Congratz Ian. Now you get to be the first guinea pig of a new app.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55654 - Posted: 31 Oct 2020 | 17:43:41 UTC - in response to Message 55651.

Ian, have you tried FAH scoring it yet?
I guess you might be able to crunch for Greg Bowman until Toni's crew updates ACEMD. Folding@Home is more generous with points. I don't know if they are based on cobblestones or not. If so, Bowmanlab's app runs faster.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55655 - Posted: 31 Oct 2020 | 18:27:42 UTC

Too bad that F@H is no longer on the BOINC platform so that points would apply.

I know Dr Bowman is a Stanford graduate and they competed with Berkley on this stuff during development. I think they need to develop the F@H user interface further to match the power that BOINC gives experienced crunchers.

That said, I don't know of any other GPU based research dealing directly with a COVID-19 vaccine out there and in communicating with Greg I have found him to be quite approachable and conversational.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55659 - Posted: 1 Nov 2020 | 3:12:47 UTC - in response to Message 55654.

I will receive the card next Thursday 11/5.

I don’t participate in FaH. And from what I understand, their application is not CUDA, but rather OpenCL. At least FAHbench is. So it likely won’t see the most benefit from the new architecture. Having a CUDA 11.1 app is crucial for this.

What I did do in order to try to test CUDA performance is ive recompiled the SETI special CUDA application for CUDA 11.1. And I have an offline benchmarking tool and some old SETI workunits so I can check relative CUDA performance between a 2080ti and the 3070. This should give me a good baseline of relative performance to expect with Ampere when the GPUGRID app finally gets updated.

____________

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55660 - Posted: 1 Nov 2020 | 15:43:37 UTC - in response to Message 55659.
Last modified: 1 Nov 2020 | 15:57:17 UTC

Thanks for pointing out the openCL vs CUDA factor, Ian. I didn't think about FAHcore using a different function.

That makes me wonder if PassMark GPU direct compute scores might be a better reference than FAH scores for predicting performance running ACEMD.
(edit)After doing some checking I see PassMark's bench is not CUDA either. Is there a CUDA1.1 benchmark test anywhere?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55661 - Posted: 1 Nov 2020 | 15:50:43 UTC - in response to Message 55659.
Last modified: 1 Nov 2020 | 16:01:15 UTC

I don’t participate in FaH. And from what I understand, their application is not CUDA, but rather OpenCL. At least FAHbench is.
Well, I do. The present FAHcore 22 v0.0.13 is CUDA, though the exact CUDA version number used is not disclosed.

folding@home wrote:
As of today, your folding GPUs just got a big powerup! Thanks to NVIDIA engineers, our Folding@home GPU cores — based on the open source OpenMM toolkit — are now CUDA-enabled, allowing you to run GPU projects significantly faster. Typical GPUs will see 15-30% speedups on most Folding@home projects, drastically increasing both science throughput and points per day (PPD) these GPUs will generate.

This post of mine earlier in this thread discussed the performance gain of switching FAHcore from OpenCL to CUDA.
As NVidia helped them to develop this new app, and NVidia gives the largest support for the folding@home project of all the corporations, I suppose they did their best (i.e. it's using CUDA11 to get the most out of their brand new cards).
Slightly supporting this supposition is that they asked us to upgrade our drivers, and that the
... core22 0.0.13 should automatically enable CUDA support for Kepler and later NVIDIA GPU architectures ...
aligns with that CUDA 11 is backwards compatible with Kepler.

Ian&Steve C. wrote:
So it likely won’t see the most benefit from the new architecture. Having a CUDA 11.1 app is crucial for this.
That's true, it's also crucial to make these cards usable for GPUGrid in any way.

What I did do in order to try to test CUDA performance is ive recompiled the SETI special CUDA application for CUDA 11.1. And I have an offline benchmarking tool and some old SETI workunits so I can check relative CUDA performance between a 2080ti and the 3070.
I'm really interested in those results!
This should give me a good baseline of relative performance to expect with Ampere when the GPUGRID app finally gets updated.
This is where our opinions sunder: SETI (and other analytical applications) are using a lot of FFTs, which could benefit from the "extra" FP32 units, while MD simulations are not. So I don't consider SETI as an adequate benchmark for GPUGrid.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55662 - Posted: 1 Nov 2020 | 17:52:46 UTC

Thank you for sharing your insight, Zoltan. I much appreciate your perspective and experience in DC.

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 105,618,140
RAC: 344,153
Level
Cys
Scientific publications
wat
Message 55663 - Posted: 1 Nov 2020 | 18:21:03 UTC

After the recent rollout of CUDA support on F@H, that I shared with you in the very same thread initially, there now exist specific CUDA cores that helped to increase performance of NIVIDA cards tremendously on F@H. As Zoltan pointed out, there are still many unknown variables around the provided data, so take the performance charts provided on their website.

Risking to be off-topic in this thread... I like to diversify my contribution and occasionally run F@H for a couple WUs before switching back to GPUGrid. What I like about their infrastructure is that you can check out a project description for all currently running WUs in addition to a preference you can set directly in the software to specify what cause (disease) you want to focus your computational sources on. You can choose from various targets, such as high priority projects, Parkinsons, cancer, Covid-19, Alzheimer, Huntington. Those projects change from time to time depending on what research is currently being conducted. In the past there have been other research projects around Dengue fever, Chagas disease, Zika virus, Hepatitis C, Ebola virus, Malaria, antibiotic resistance and Diabetes. If there is currently no work for the chosen preference, it immediately switches to the highest priority projects. In that sense, I feel like F@H offers an informational advantage over GPUGrid, but certainly lacks the flexibility and options for cruncher that F@H lacks. My card (GTX750 Ti) finishes most tasks at F@H ahead of the first deadline and thus receives the early finish bonus points. It matches my RAC on GPUGrid if running exclusively 24/7 at around 100k credits. In the end I believe both projects have their own place and complement each other.

Anyways, I am also looking forward to seeing those benchmark results. Awesome that you managed to grab one in spite of the current sparsity of supply!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55668 - Posted: 1 Nov 2020 | 19:05:31 UTC - in response to Message 55661.

if they get the CUDA changes into the FAHbench application, I’ll try it out. But last time I looked at it, the benchmarking app still only used opencl. When comparing something like two different cards you need to eliminate as many variables as possible. Using the standard FAH app without the control the run the same exact work units over and ove the best I could do is run run each card for a few months to get average PPD numbers from each. I’m just not willing to put that level of effort into it, when I can likely get the same results from a quicker benchmark.

From what I’m reading, to get the benefit of the new data pipeline, you need CUDA 11.1, not just cuda 11.

While seti does different types of calculation, the comparison of architectures and models should be very comparable. When I moved from SETI to GPUGRID the hierarchy of GPUs remained the same. So the relative performance of 3070 to 2080ti observed with SETI should translate pretty closely with what we will probably observe on GPUGRID (provided they build an 11.1 app also).
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55689 - Posted: 5 Nov 2020 | 21:02:55 UTC - in response to Message 55668.

I got my EVGA 3070 black today and did some testing

PLEASE KEEP IN MIND: this is very preliminary testing with an application that does different calculation types than GPUGRID. this is merely an attempt to compare against another BOINC project using a CUDA app in a controlled benchmark kind of way.

testing platform:
Asrock X570M Pro4 mATX (x16 slots on position 1 and 4)
AMD Ryzen9 3900X @ 4.20GHz
16GB GSkill DDR4-3600 CL14
Phanteks mATX case with only 4 PCI expansion slots
SETI@home special application recompiled by me for CUDA 11.1 on Linux

Had a bit of trouble getting it to work in my Linux system at first. Thought it might be the drivers, 3070 "needs" the 455.38 driver, but previous tests and info from others have shown that slightly old drivers usually work just fine for new cards when the architecture is the the same, so 455.23.04 *should* work just fine, only reporting a generic card name like "Nvidia Graphics Device". So i went through the trouble to install the nvidia drivers from the run file. installed 455.38 with little issue, and it still was being very flaky, randomly dropping the GPU, extreme lag navigating the desktop, very low GPU utilization, or even failing to boot.

Narrowed it down to the PCIe gen3 ribbon risers and the card constantly trying to run at PCIe gen4 no matter what settings I used in the BIOS (I set all slots to gen3 but nvidia-settings still reported running at gen4). due to the motherboard and case layout, i was unable to install the card directly into the motherboard bottom slot due to lack of enough space and i don't have any gen4 capable risers. So i carefully removed and set aside the 2080ti which is custom watercooled and hooked in-line with the CPU to be able to plug the 3070 directly into the top slot.

Finally working as it should. no random issues.

Testing results with CUDA 11.1 special app:
about 75-85% speed vs. my 2080ti on my collection of WUs
1980-1995 MHz (same as 2080ti)
14000MHz mem clock on both 2080ti and 3070
98-99% GPU Utilization reported
but only ~190-200W power as reported by nvidia-smi unrestrained, 220W PL (2080ti used 225W, at the PL)

keep in mind that Ampere went back to 128 cuda cores per SM, like Pascal, but unlike Pascal the data paths are different. Also remember that the special app seems to heavily favor SM count, likely due to the way it is coded. it is my guess that either SETI calculations use more integer math, or both FP32 data paths are not being used.

3070 has 46 SMs, the 2080ti has 68 SMs
so with 67% the SMs (and CUDA cores), its giving ~80% the performance, not bad.

but power used comparison isn't so kind.
at 89% power used, its giving ~80% performance.
this might be able to be improved upon with power limiting, but I didn't test this right now

So I haven't lost hope yet. I'm hopeful that if GPUGRID does use mostly FP32, and they recode their app to take advantage of both pipelines in a CUDA 11.1 app, that we could see better performance.

I'm waiting for the new app to be ready.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55690 - Posted: 5 Nov 2020 | 22:30:43 UTC - in response to Message 55689.
Last modified: 5 Nov 2020 | 22:32:07 UTC

i attempted to build FAHbench with CUDA support (the information says it's possible), but I'm hitting a snag at configuring OpenMM.

I get errors when trying to configure OpenMM with CMake (as outlined in the FAHbench instructions)
I cant check the OpenMM docs since http://docs.openmm.org seems to be a dead link
I can't ask about it on the forums unless I register
I can't register unless they accept my request (says it'll take 72 hrs)
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55691 - Posted: 5 Nov 2020 | 22:40:15 UTC - in response to Message 55689.
Last modified: 5 Nov 2020 | 22:41:10 UTC

I got my EVGA 3070 black today and did some testing
...
Testing results with CUDA 11.1 special app:
about 75-85% speed vs. my 2080ti on my collection of WUs
Thank you for all the effort you have put in this benchmark!
Regrettably your benchmarks confirmed my expectations.
Performance wise it's a bit better than I've expected (67.6%+10%~74.4% of the 2080 Ti).
Power consumption wise it seems as of yet that it's not worth to invest in upgrading from the RTX 2*** series for crunching.

bozz4science
Send message
Joined: 22 May 20
Posts: 110
Credit: 105,618,140
RAC: 344,153
Level
Cys
Scientific publications
wat
Message 55692 - Posted: 5 Nov 2020 | 22:51:43 UTC - in response to Message 55691.

Kudos to you! From what I understand so far, I have to support Zoltan's opinion. I really must stress that I am very keen on efficiency as that is sth that everyone should factor in their hardware decisions.

Anyway, it still seems to offer a valid value proposition for me. Especially at this price point. And it is yet very early and future support of RTX 30xx cards could definitely offer some potential for optimisation.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55693 - Posted: 5 Nov 2020 | 23:04:32 UTC - in response to Message 55691.
Last modified: 5 Nov 2020 | 23:21:34 UTC

I got my EVGA 3070 black today and did some testing
...
Testing results with CUDA 11.1 special app:
about 75-85% speed vs. my 2080ti on my collection of WUs
Thank you for all the effort you have put in this benchmark!
Regrettably your benchmarks confirmed my expectations.
Performance wise it's a bit better than I've expected (67.6%+10%~74.4% of the 2080 Ti).
Power consumption wise it seems as of yet that it's not worth to invest in upgrading from the RTX 2*** series for crunching.


take it with a grain of salt so far, which is exactly why I made the disclaimer. as you yourself mentioned, the types of calculations are different, and if SETI is performing a large number of INT calculations than this result wouldn't be unexpected. Ampere should see the most benefit from a pure FP32 load, and according to your previous comments, GPUGRID should be mostly FP32. It could also be that source code changes might be necessary to take full advantage of the new architecture. the new SETI app has ZERO source code changes from the older 10.2 app, I simply compiled it with the 11.1 CUDA library instead of 10.2.

that's why I was attempting to build a FAHbench version with CUDA 11.1, but I hit a snag there and will have to wait.

I don't do FAH, but since users here have said that GPUGRID is similar in work performed and software used, the 3070 should perform on par with the 2080ti.

check this page for a comparison: https://folding.lar.systems/folding_data/gpu_ppd_overall

showing the F@h PPD of a 3070 *just* behind the 2080ti.
at $500 and 220W, that makes sense. not as power efficient as I'd like, but pushing it beyond the 2080ti nonetheless
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55694 - Posted: 6 Nov 2020 | 0:53:58 UTC - in response to Message 55690.
Last modified: 6 Nov 2020 | 0:59:46 UTC

From the issue raised on the OpenMM github repo, it seems they let their SSL certificate expire back in September.

And no one has done anything about it.

https://github.com/openmm/openmm-org/issues/38

https://github.com/openmm/openmm

This OpenMM forum?

https://simtk.org/plugins/phpBB/indexPhpbb.php?group_id=161&pluginname=phpBB

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55695 - Posted: 6 Nov 2020 | 1:11:47 UTC

The compiler optimizations for Zen 3 haven't made it into any linux kernel yet either. GCC11 and CLANG12 are supposed to get znver3 targets in the upcoming 5.10 kernel next April for the 21.04 distro release.

https://www.phoronix.com/scan.php?page=news_item&px=AMD-Zen-3-Linux-Expectations

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55720 - Posted: 10 Nov 2020 | 22:15:32 UTC
Last modified: 10 Nov 2020 | 23:05:23 UTC

got the 3070 up and running on Einstein@home for some more testing.


As far as Einstein performance:

the 2080ti does the current batch GW tasks in about 4 minutes, using 225W (360 t/day)
the 3070 does the current batch GW tasks in about 5 minutes, using 170W (288 t/day)
the 2080ti does the FGRP tasks in about 6 minutes, using 225W (240 t/day)
the 3070 does the FGRP tasks in about 7:20 minutes, using 150W (196.4 t/day)

so for Einstein GW tasks, the 3070 is about 5% more efficient
and for the Einstein GR tasks, the 3070 is about 19% more efficient

now this isnt to say you should buy Ampere (or even nvidia cards) for Einstein, since some of the newer AMD cards perform much better there (faster and overall more efficient). But this will serve as an additional data point with a different processing type. you can see the efficiency improvements here, especially on the GR tasks.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55731 - Posted: 12 Nov 2020 | 23:54:31 UTC - in response to Message 55720.
Last modified: 12 Nov 2020 | 23:56:42 UTC

I wonder if you slowed down the 2080Ti (reducing GPU voltage accordingly as well) by 20% to match the speed of the 3070, would it be about the same effective? (or even better in the case of GW tasks)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 55732 - Posted: 13 Nov 2020 | 1:57:11 UTC - in response to Message 55731.

the 2080ti was already power limited to 225W. its not possible (under Linux) to reduce the voltage. there are just no tools for it. reducing the power limit will have the indirect effect of reducing voltage, but I don't have control of the voltage directly. I could power limit further, but it will only slow the card further. you start to lose too much clock speed below 215-225W in my experience (across 6 different 2080tis). the 2080ti was also watercooled with temps never exceeding 40C so it had as much of an advantage as it could have had.

the 3070 was run at a 200W power limit, but it never even came close to that in these Einstein loads, with speed probably only limited by temp/clock boost bins. But that's really par for the course on mid-level Nvidia cards running Einstein tasks, the GR tasks are just light weight and dont pull a lot of power, and the GW tasks are more CPU bound than anything else so you dont get full GPU utilization. further efficiency gains could probably be made here on the 3070 with more power limiting and overclocking, but I didn't bother for this test.
____________

SolidAir79
Send message
Joined: 22 Aug 19
Posts: 7
Credit: 168,393,363
RAC: 0
Level
Ile
Scientific publications
wat
Message 56525 - Posted: 15 Feb 2021 | 21:18:14 UTC

Any apps i can use yet on my 30s cards?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 56527 - Posted: 15 Feb 2021 | 22:19:59 UTC - in response to Message 56525.

Any apps i can use yet on my 30s cards?


nope.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 56590 - Posted: 17 Feb 2021 | 0:25:32 UTC

https://www.gpugrid.net/workunit.php?wuid=27026028

with no Ampere compatible app, and no mechanism to prevent work from being sent to incompatible systems, situations like this will only become more common as more and more users upgrade to these new cards.

3 out of the 6 systems that have handled this WU were using Ampere cards and failed because of it.

If we had an Ampere compatible CUDA 11.1 app, this would have been completed by the first system.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56601 - Posted: 17 Feb 2021 | 12:29:28 UTC - in response to Message 56590.
Last modified: 17 Feb 2021 | 12:29:40 UTC

I've "saved" a workunit earlier (my host was the 7th crunching it - the 1st successful one). It had been sent to two ampere cards before.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 56604 - Posted: 17 Feb 2021 | 15:03:36 UTC - in response to Message 56601.

i have a couple like that that I similarly saved. even some _7s (8th)

this one, 50% of the users had RTX 30-series https://www.gpugrid.net/workunit.php?wuid=27025291
____________

Asghan
Send message
Joined: 30 Oct 19
Posts: 6
Credit: 405,900
RAC: 0
Level

Scientific publications
wat
Message 57019 - Posted: 25 Jun 2021 | 6:51:57 UTC

Is there any update regarding nVidia Ampere Workunits??
My Ampere cards are getting bored -.-

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 57355 - Posted: 21 Sep 2021 | 12:31:24 UTC
Last modified: 21 Sep 2021 | 12:32:54 UTC

my 3080Ti (limited to 300W mind you) did this ADRIA task in under 9.5hrs

https://www.gpugrid.net/result.php?resultid=32642251

anyone with a high power 3090 or 3080Ti run faster? my model 3080Ti will only go up to 366W, but I know some 3080Tis and 3090s can reach into the 400-500W range.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 6,853,981,809
RAC: 48,267,988
Level
Tyr
Scientific publications
wat
Message 57980 - Posted: 1 Dec 2021 | 1:42:27 UTC - in response to Message 57355.

I am not running 3090s or 3080ti cards but I do have some times/comparisons for high-end Turing and Ampere GPUs for Adria tasks.

Nvidia Quatro RTX6000- 8.4 hours
Nvidia RTX A6000- 6.7 hours

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57995 - Posted: 1 Dec 2021 | 16:59:57 UTC - in response to Message 57980.

I am not running 3090s or 3080ti cards but I do have some times/comparisons for high-end Turing and Ampere GPUs for Adria tasks.

Nvidia Quatro RTX6000- 8.4 hours
Nvidia RTX A6000- 6.7 hours
The NVidia Quadro RTX 6000 is a "full chip" version of the RTX 2080Ti (4608 vs 4352 CUDA cores)
while the NVidia RTX A6000 is the "full chip" version of the RTX 3090 (10752 vs 10496 CUDA cores).
The rumoured RTX 3090Ti will have the "full chip" also.
The RTX 3080 Ti has 10240 CUDA cores.
(The real world GPUGrid performance of the Ampere architecture cards scales with the half of the advertised number of CUDA cores).

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 6,853,981,809
RAC: 48,267,988
Level
Tyr
Scientific publications
wat
Message 57996 - Posted: 1 Dec 2021 | 17:18:51 UTC - in response to Message 57995.


(The real world GPUGrid performance of the Ampere architecture cards scales with the half of the advertised number of CUDA cores).



Is that true for all NVidia GPUs or just Ampere? Just out of curiosity, why is it this way?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58002 - Posted: 1 Dec 2021 | 19:37:41 UTC - in response to Message 57996.

Just guessing here but since every new generation of Nvidia cards has basically doubled or at least increased the CUDA core count and since the GPUGrid app as well as a very few other project apps are really well coded for parallelization of computation, you can state that the crunch time scales with more cores.

You can tell how well optimized an application is by how much sustained utilization it produces and how close to the max TDP of the card the app runs.

The GPUGrid apps and the Minecraft apps are the only two apps that I know of that will run at 97-100 utilization through the entire computation at the full power capability of the card.

Kudos to the app developers of these projects. Job well done!

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58003 - Posted: 1 Dec 2021 | 19:43:18 UTC - in response to Message 57996.
Last modified: 1 Dec 2021 | 20:29:26 UTC

(The real world GPUGrid performance of the Ampere architecture cards scales with the half of the advertised number of CUDA cores).

Is that true for all NVidia GPUs or just Ampere? Just out of curiosity, why is it this way?
It's true only for the Ampere architecture.

As you can see on the picture above, the number of FP32 units have been doubled in the Ampere architecture (the INT32 units have been "upgraded"), but it resides within the (almost) same streaming multiprocessor (SM), so it could not feed much better that many cores within the SM. From a cruncher's point of view the number of SMs should have been doubled as well (by making "smaller" SMs).
The other limiting factor is the power consumption, as the RTX 3080Ti (RTX3090 etc) easily hits the 350W power limit with this architecture.

https://www.reddit.com/r/hardware/comments/ikok1b/explaining_amperes_cuda_core_count/
https://www.tomshardware.com/features/nvidia-ampere-architecture-deep-dive
https://support.passware.com/hc/en-us/articles/1500000516221-The-new-NVIDIA-RTX-3080-has-double-the-number-of-CUDA-cores-but-is-there-a-2x-performance-gain-

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58005 - Posted: 1 Dec 2021 | 21:42:05 UTC

As long as you can keep an INT32 operation out of the warp scheduler, then Ampere series can do two FP32 operations on the same clock.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 58009 - Posted: 2 Dec 2021 | 0:07:19 UTC - in response to Message 58005.
Last modified: 2 Dec 2021 | 0:25:06 UTC

and since the real performance doesn't scale with the FP core count, that leads us to the conclusion that the GPUGRID app must be coded with a fair number of INT operations which cut into the FP cores available (half of the FP cores are actually shared FP/INT cores and can only do one type of operation at a time, while the other half are dedicated FP32). This explains the massive performance boost of Turing cards over Pascal (at the same FP count) for GPUGRID since Turing introduced concurrent FP/INT processing.

https://developer.nvidia.com/blog/nvidia-turing-architecture-in-depth/

the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath. In previous generations, executing these instructions would have blocked floating-point instructions from issuing


Einstein scales much better with FP core count on Ampere, but is also more reliant on memory speed/latency than GPUGRID. if you take this "only half" rule, it's doesn't stack up with real world gains seen on Einstein. A 3080Ti is 70-75% faster than a 2080Ti on Einstein. while under the 1/2 rule "only" has 17% more cores. so one obviously can't paint with such a broad brush to include all of crunching.

all depends on how the app is coded to use the hardware, and sometimes you can't make an app totally optimized for a certain GPU architecture depending on what kinds of computations you need to do or what coding methods you use.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,222,865,968
RAC: 1,764,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58017 - Posted: 2 Dec 2021 | 9:46:42 UTC - in response to Message 58009.
Last modified: 2 Dec 2021 | 9:58:16 UTC

... the GPUGRID app must be coded with a fair number of INT operations which cut into the FP cores available (half of the FP cores are actually shared FP/INT cores and can only do one type of operation at a time, while the other half are dedicated FP32).
Perhaps MD simulations don't rely on that many INT operations, so it's independent from the coder and from the cruncher's wishes (demands).

Einstein scales much better with FP core count on Ampere ...
The Einstein app is not a native CUDA application (it's openCL), it's not good at utilizing (previous) NVidia GPUs, making this comparison inconsequential regarding the GPUGrid app performance improvement on Ampere. It's the Ampere architecture that saved the day for the Einsten app, so if the Einstein app would be coded (in CUDA) the way it could run great on Turing also, you would see the same (low) performance improvement on Ampere.

all depends on how the app is coded to use the hardware, and sometimes you can't make an app totally optimized for a certain GPU architecture depending on what kinds of computations you need to do or what coding methods you use.
Well, how the app is coded depends on the problem (the research area) and the methodology of the given research, and the program language, which is chosen by the targeted range of hardware. The method is the reason for the impossibility of "the GPUGRID app must be coded with a fair number of INT operations" demand. (FP32 is needed to calculate trajectories.) The targeted (broader) range of hardware is the reason for Einstein is coded in openCL, resulting in lower NVidia GPU utilization on previous GPU generations.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1069
Credit: 40,231,533,983
RAC: 432
Level
Trp
Scientific publications
wat
Message 58018 - Posted: 2 Dec 2021 | 12:41:09 UTC - in response to Message 58017.
Last modified: 2 Dec 2021 | 13:32:45 UTC

... the GPUGRID app must be coded with a fair number of INT operations which cut into the FP cores available (half of the FP cores are actually shared FP/INT cores and can only do one type of operation at a time, while the other half are dedicated FP32).
Perhaps MD simulations don't rely on that many INT operations, so it's independent from the coder and from the cruncher's wishes (demands).

Einstein scales much better with FP core count on Ampere ...
The Einstein app is not a native CUDA application (it's openCL), it's not good at utilizing (previous) NVidia GPUs, making this comparison inconsequential regarding the GPUGrid app performance improvement on Ampere. It's the Ampere architecture that saved the day for the Einsten app, so if the Einstein app would be coded (in CUDA) the way it could run great on Turing also, you would see the same (low) performance improvement on Ampere.

all depends on how the app is coded to use the hardware, and sometimes you can't make an app totally optimized for a certain GPU architecture depending on what kinds of computations you need to do or what coding methods you use.
Well, how the app is coded depends on the problem (the research area) and the methodology of the given research, and the program language, which is chosen by the targeted range of hardware. The method is the reason for the impossibility of "the GPUGRID app must be coded with a fair number of INT operations" demand. (FP32 is needed to calculate trajectories.) The targeted (broader) range of hardware is the reason for Einstein is coded in openCL, resulting in lower NVidia GPU utilization on previous GPU generations.


Nvidia cards only run CUDA. OpenCL code gets compiled into CUDA at runtime. But just using OpenCL doesn’t mean that it can’t effectively use the GPU, it’s all in the coding. The compiling to CUDA at runtime only provides a very small overhead. A new Einstein app was released several months ago which I was involved in the testing and development. Mostly coded/modified by another user (petri, the same guy who wrote the SETI special CUDA app), and then I forwarded and explained the changes to the Einstein devs for integration and release. It’s now available to everyone and provides big gains in performance for all Nvidia GPUs all the way back to Maxwell architecture, so yes the new coding applies to Turing also and Ampere saw major gains over Turing. But Ampere by far had the best gains. Maxwell and Pascal cards see about a 40% improvement over the old app, Turing about 60% improvement, and Ampere had over 100% improvement. It basically puts Nvidia back on level ground or even ahead of AMD for Einstein. The issue was a certain command and parameters being used, which caused memory access serialization on Nvidia GPUs, effectively holding them back. Petri recoded it a different way to basically remove the limiter, parallelize the memory access and allow the GPU to run full speed.

So yes, it’s relevant and shows how different code can be used for the exact same computation to utilize the hardware better and increase performance. Also shows how you can’t say that you can only figure “half” of the cores for all of crunching when Ampere performance when it clearly has benefits that’s scale much closer to the FP core count with certain projects. If you look at the Ampere architecture and understand it, you’ll see that the only reason that “half” of the FP cores wouldn’t be used, is if INT operations are running. regarding GPUGRID's app, if you observe the behavior of what the app is doing with the hardware, you'll see that one trait stands out from most other projects, high PCIe bus utilization. the app is sending A LOT of data over the PCIe bus, equivalent to almost a full PCIe 3.0 x4 worth of bandwidth, for the whole run. The app is coded in a way where lots of data is being sent between the CPU and the GPU over the PCIe bus, but very little stored in GPU memory. these kinds of operations usually involve a lot of integer adds for data fetching as well as for FP compares or min/max processing. So while the meat of the computation is for trajectories, there are a lot of other things that the app needs to do in INT to get and send the data and organize the results. I imagine that if the devs changed their code philosophy to store more data in GPU memory, they could cut out a lot of the excess involved in sending so much data over the PCIe bus, keep things more local to the GPU, and speed up processing overall. This would have the caveat of excluding some low VRAM GPUs however.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,681,721,308
RAC: 13,169,240
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58019 - Posted: 2 Dec 2021 | 23:27:14 UTC - in response to Message 58018.

👏👍

Post to thread

Message boards : Graphics cards (GPUs) : Ampere 10496 & 8704 & 5888 fp32 cores!

//