Advanced search

Message boards : News : New acemdshort app 846

Author Message
Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38185 - Posted: 30 Sep 2014 | 8:32:04 UTC

I've promoted the CUDA65 app version 846 from beta to short.

You'll only get this if you have a Kepler or Maxwell card, and have a CUDA 6.5-capable driver, in practice rev 343 or higher.

Please post any problems or regressions here.

Matt

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38187 - Posted: 30 Sep 2014 | 10:49:34 UTC

Looking good. Boinc reporting 0.90 worth of CPU for 6.5, but task manager only at 1-2%. For Beta tasks boinc reported same, and task showed 1-2%.

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 943,692
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38190 - Posted: 30 Sep 2014 | 13:03:31 UTC

First NOELIA_SH2 WU on GTX980 completed & validated with beta app.

http://www.gpugrid.net/result.php?resultid=13145399

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 943,692
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38191 - Posted: 30 Sep 2014 | 13:12:06 UTC - in response to Message 38187.

Looking good. Boinc reporting 0.90 worth of CPU for 6.5, but task manager only at 1-2%. For Beta tasks boinc reported same, and task showed 1-2%.


I saw that too on windows 8.1 so I added the environment variable swan_sync with a value of 0 and rebooted. Now I see ~100% core usage. I'm not sure if it will make a difference but it makes me feel better.

See this thread for discussion of swan_sync:

http://www.gpugrid.net/forum_thread.php?id=2123

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38192 - Posted: 30 Sep 2014 | 13:32:25 UTC - in response to Message 38190.
Last modified: 30 Sep 2014 | 13:32:38 UTC

First NOELIA_SH2 WU on GTX980 completed & validated with beta app.

http://www.gpugrid.net/result.php?resultid=13145399

But biodoc, do you have run times of these WU's on non-Maxwell to compare?
That is where I am very interested in.
____________
Greetings from TJ

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38193 - Posted: 30 Sep 2014 | 13:35:41 UTC - in response to Message 38190.
Last modified: 30 Sep 2014 | 13:36:59 UTC

biodoc, thanks for the tip---You're Win8.1 system has WDDM tax of ~7% compared XP. You're Win8.1 is blazing fast. Have you tested you're GTX780Ti with new short CUDA 6.5? I'm very curious to see how well GTX 780ti performs with new refined code compared to GM204. Also,GM204 shows how Maxwell able to carry more threads (atoms) per SMM vs. SMX.

Very impressive to see GTX970 (1664c/104TMU/64ROP) completing tasks in similar or faster times, than GK110 GTX780--(2304c/192TMU/48ROP at Beta APP performance chart. Considering the amount TMU for GTX970 are less, and ACEMD TMU usage is high, this shows how a 145TDP board performing at 225TDP GTX780 levels or above. For anyone with higher taxes/ energy rates, the GTX970 looks to be choice card. (unless future GTX960 doesn't lose more than a couple SMM compared to GTX 970)

Excellent code refinement by Matt.

Variable swan are for (you're) Higher end cards. For my lowly (2) GK107--- Swan_sync makes no difference.

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 943,692
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38195 - Posted: 30 Sep 2014 | 14:14:19 UTC - in response to Message 38192.

First NOELIA_SH2 WU on GTX980 completed & validated with beta app.

http://www.gpugrid.net/result.php?resultid=13145399

But biodoc, do you have run times of these WU's on non-Maxwell to compare?
That is where I am very interested in.


No, my 780TI is on a linux box and exclusively runs the long WUs. The beta app is for Windows only so we need a data from a 780Ti using the new app for a fair comparison, I think.

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 943,692
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38196 - Posted: 30 Sep 2014 | 14:27:18 UTC

The NOELIA_SH2 WU I just finished is ranked #6 in the new Performance section.

2.79 hours.

http://www.gpugrid.net/performance.php#!

Windows 8.1, nvidia driver 344.16.

For the NOELIA_SH2 WUs, my GPU load is only 76%, Memory controller load is 25% and 76% TDP. At 65% fan speed, the GPU temp is 62C. Also Swan_Sync=0

I'm anxious to test in linux, but I can wait.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2346
Credit: 16,293,515,968
RAC: 5,831,839
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38200 - Posted: 30 Sep 2014 | 17:22:28 UTC - in response to Message 38193.

Excellent code refinement by Matt.

Was there any code refinement between 8.44 and and 8.46?

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38202 - Posted: 30 Sep 2014 | 17:37:58 UTC - in response to Message 38200.

In the "Maxwell now" thread he mentioned----


I found one of many papers written by you and others-- "ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale" during golden days of GT200. A Maxwell update: if applicable- would be very informative.


I'm doing a bit of work to improve the performance of the code for Maxwell hardware - expect an update before the end of the year.

Matt


I'm assuming there was.


gianni
Send message
Joined: 8 Feb 13
Posts: 5
Credit: 6,750
RAC: 0
Level

Scientific publications
wat
Message 38204 - Posted: 30 Sep 2014 | 18:11:52 UTC - in response to Message 38202.

Nope, that's just a rebuild, modulo a fix for a compiler regression.

The good stuff is still fermenting.

M

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38205 - Posted: 30 Sep 2014 | 18:21:02 UTC - in response to Message 38204.



The good stuff is still fermenting.

M


Can't wait for recipe to be added, when the grapes are wine.


eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38284 - Posted: 3 Oct 2014 | 23:10:21 UTC

While searching for runtimes/processing rates for GTX980/970-- I noticed a abnormal variance concerning the 8.46 short app "Average processing rate". This number 653.09405051673 was taken from host113695 with a (GTX980). While my GT650m "average processing rate" is 71.024125852776 for the same CUDA6.5/8.46app. What's the formula for average processing rate?

How does a much more powerful GPU have the smaller number? If I'm misunderstanding the numbers, could someone explain how a GTX980 shows 11digits after decimal point, while a GT650m has 12? A GTX 980 finishes a NoeliaSH2 task in 7,500-8,000s. A GT650m completes same task in 59,000-65,000s. In GFLOPS terms- a GTX 980 is 7.75 GT650m worth of cards.

FYI: For 8.46 Beta app- host113695 GTX980 has a 1627.7621166778 processing rate, while my GT650m processing rate is 193.43498955808

This same user GTX780Ti CUDA6.0/8.41 long app processing rate is 310.15025279058, for the same app a GT650m is 41.592640653642- again showing more digits after decimal point.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38512 - Posted: 14 Oct 2014 | 18:29:21 UTC - in response to Message 38193.

eXaPower wrote:
ACEMD TMU usage is high

I don't have insight into the actual code, but TMUs are Texture Mapping Units. They are fixed functions units to map textures to geometry and I highly doubt they can be exploited for GPU-Grid. The same applies to ROPs: these are Raster Output Units, i.e. they deal with assembling the finalized images ("pushing the pixels"). We're not pixelating anything at GPU-Grid or in other GP-GPU apps. Think of GP-GPU work of endless loops of matrix and vector operations, which are all performed on the shaders.

eXaPower wrote:
could someone explain how a GTX980 shows 11digits after decimal point, while a GT650m has 12?

That seems to be simply caused by the number of total digits being equal to 14. BTW: consider the variance in WU completion times. You can easily round those numbers to 3 significant digits, anything else will be drowned in "experimental noise" anyway:

GTX980: 653.09405051673 -> 653
GT650m: 71.024125852776 -> 71.0

This also answers your other question:
How does a much more powerful GPU have the smaller number?

It doesn't, see the numbers above.

BTW2: you also mention a factor of about 8 in performanc ebetween these cards, based on other measures. The factor between the processing rates quoted above matches this, approximately.

MrS
____________
Scanning for our furry friends since Jan 2002

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38517 - Posted: 14 Oct 2014 | 19:00:47 UTC - in response to Message 38512.
Last modified: 14 Oct 2014 | 19:10:38 UTC

ETA- Thank you for explaining what processing rate numbers mean.

Reason I mentioned Texture Mapping units-- http://multiscalelab.org/gianni/publications?action=AttachFile&do=get&...

Texture Mapping Units are "capable of performing linear interpolation of values into multidimensional (up to 3D) arrays of floating point data." Quoted from from Matt's "Accelerating Biomolecular Dynamics in the Microsecond Time Scale"-- "The texture units are used to assist the calculation of the electrostatic and van der Waals terms by providing linearly interpolated values for the radial components of those functions from lookup tables." Along other processes.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38523 - Posted: 14 Oct 2014 | 22:19:03 UTC - in response to Message 38517.

Thanks for pointing that out! The paper is from 2009, but I suspect the code has been enhanced since then, but not radically changed.

Matt, can you briefly (or as lengthy as your time allows) comment on usage of non-shader blocks in GPUs? And regarding the current question: are you still using the TMUs for table lookup? (what a neat trick! :) And does the reduced number of TMUs in Maxwell affect performance? I suspect not, unless you're constantly hammering the TMUs with requests.

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : News : New acemdshort app 846

//