New acemdshort app 846

Message boards : News : New acemdshort app 846

Author	Message
MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 38185 - Posted: 30 Sep 2014 \| 8:32:04 UTC
	I've promoted the CUDA65 app version 846 from beta to short. You'll only get this if you have a Kepler or Maxwell card, and have a CUDA 6.5-capable driver, in practice rev 343 or higher. Please post any problems or regressions here. Matt
	ID: 38185 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38187 - Posted: 30 Sep 2014 \| 10:49:34 UTC
	Looking good. Boinc reporting 0.90 worth of CPU for 6.5, but task manager only at 1-2%. For Beta tasks boinc reported same, and task showed 1-2%.
	ID: 38187 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 943,692 Level Scientific publications	Message 38190 - Posted: 30 Sep 2014 \| 13:03:31 UTC
	First NOELIA_SH2 WU on GTX980 completed & validated with beta app. http://www.gpugrid.net/result.php?resultid=13145399
	ID: 38190 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 943,692 Level Scientific publications	Message 38191 - Posted: 30 Sep 2014 \| 13:12:06 UTC - in response to Message 38187.
	Looking good. Boinc reporting 0.90 worth of CPU for 6.5, but task manager only at 1-2%. For Beta tasks boinc reported same, and task showed 1-2%. I saw that too on windows 8.1 so I added the environment variable swan_sync with a value of 0 and rebooted. Now I see ~100% core usage. I'm not sure if it will make a difference but it makes me feel better. See this thread for discussion of swan_sync: http://www.gpugrid.net/forum_thread.php?id=2123
	ID: 38191 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 38192 - Posted: 30 Sep 2014 \| 13:32:25 UTC - in response to Message 38190. Last modified: 30 Sep 2014 \| 13:32:38 UTC
	First NOELIA_SH2 WU on GTX980 completed & validated with beta app. http://www.gpugrid.net/result.php?resultid=13145399 But biodoc, do you have run times of these WU's on non-Maxwell to compare? That is where I am very interested in. ____________ Greetings from TJ
	ID: 38192 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38193 - Posted: 30 Sep 2014 \| 13:35:41 UTC - in response to Message 38190. Last modified: 30 Sep 2014 \| 13:36:59 UTC
	biodoc, thanks for the tip---You're Win8.1 system has WDDM tax of ~7% compared XP. You're Win8.1 is blazing fast. Have you tested you're GTX780Ti with new short CUDA 6.5? I'm very curious to see how well GTX 780ti performs with new refined code compared to GM204. Also,GM204 shows how Maxwell able to carry more threads (atoms) per SMM vs. SMX. Very impressive to see GTX970 (1664c/104TMU/64ROP) completing tasks in similar or faster times, than GK110 GTX780--(2304c/192TMU/48ROP at Beta APP performance chart. Considering the amount TMU for GTX970 are less, and ACEMD TMU usage is high, this shows how a 145TDP board performing at 225TDP GTX780 levels or above. For anyone with higher taxes/ energy rates, the GTX970 looks to be choice card. (unless future GTX960 doesn't lose more than a couple SMM compared to GTX 970) Excellent code refinement by Matt. Variable swan are for (you're) Higher end cards. For my lowly (2) GK107--- Swan_sync makes no difference.
	ID: 38193 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 943,692 Level Scientific publications	Message 38195 - Posted: 30 Sep 2014 \| 14:14:19 UTC - in response to Message 38192.
	First NOELIA_SH2 WU on GTX980 completed & validated with beta app. http://www.gpugrid.net/result.php?resultid=13145399 But biodoc, do you have run times of these WU's on non-Maxwell to compare? That is where I am very interested in. No, my 780TI is on a linux box and exclusively runs the long WUs. The beta app is for Windows only so we need a data from a 780Ti using the new app for a fair comparison, I think.
	ID: 38195 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 943,692 Level Scientific publications	Message 38196 - Posted: 30 Sep 2014 \| 14:27:18 UTC
	The NOELIA_SH2 WU I just finished is ranked #6 in the new Performance section. 2.79 hours. http://www.gpugrid.net/performance.php#! Windows 8.1, nvidia driver 344.16. For the NOELIA_SH2 WUs, my GPU load is only 76%, Memory controller load is 25% and 76% TDP. At 65% fan speed, the GPU temp is 62C. Also Swan_Sync=0 I'm anxious to test in linux, but I can wait.
	ID: 38196 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2346 Credit: 16,293,515,968 RAC: 5,831,839 Level Scientific publications	Message 38200 - Posted: 30 Sep 2014 \| 17:22:28 UTC - in response to Message 38193.
	Excellent code refinement by Matt. Was there any code refinement between 8.44 and and 8.46?
	ID: 38200 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38202 - Posted: 30 Sep 2014 \| 17:37:58 UTC - in response to Message 38200.
	In the "Maxwell now" thread he mentioned---- I found one of many papers written by you and others-- "ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale" during golden days of GT200. A Maxwell update: if applicable- would be very informative. I'm doing a bit of work to improve the performance of the code for Maxwell hardware - expect an update before the end of the year. Matt I'm assuming there was.
	ID: 38202 \| Rating: 0 \| rate: / Reply Quote

gianni Send message Joined: 8 Feb 13 Posts: 5 Credit: 6,750 RAC: 0 Level Scientific publications	Message 38204 - Posted: 30 Sep 2014 \| 18:11:52 UTC - in response to Message 38202.
	Nope, that's just a rebuild, modulo a fix for a compiler regression. The good stuff is still fermenting. M
	ID: 38204 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38205 - Posted: 30 Sep 2014 \| 18:21:02 UTC - in response to Message 38204.
	The good stuff is still fermenting. M Can't wait for recipe to be added, when the grapes are wine.
	ID: 38205 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38284 - Posted: 3 Oct 2014 \| 23:10:21 UTC
	While searching for runtimes/processing rates for GTX980/970-- I noticed a abnormal variance concerning the 8.46 short app "Average processing rate". This number 653.09405051673 was taken from host113695 with a (GTX980). While my GT650m "average processing rate" is 71.024125852776 for the same CUDA6.5/8.46app. What's the formula for average processing rate? How does a much more powerful GPU have the smaller number? If I'm misunderstanding the numbers, could someone explain how a GTX980 shows 11digits after decimal point, while a GT650m has 12? A GTX 980 finishes a NoeliaSH2 task in 7,500-8,000s. A GT650m completes same task in 59,000-65,000s. In GFLOPS terms- a GTX 980 is 7.75 GT650m worth of cards. FYI: For 8.46 Beta app- host113695 GTX980 has a 1627.7621166778 processing rate, while my GT650m processing rate is 193.43498955808 This same user GTX780Ti CUDA6.0/8.41 long app processing rate is 310.15025279058, for the same app a GT650m is 41.592640653642- again showing more digits after decimal point.
	ID: 38284 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38512 - Posted: 14 Oct 2014 \| 18:29:21 UTC - in response to Message 38193.
	eXaPower wrote: ACEMD TMU usage is high I don't have insight into the actual code, but TMUs are Texture Mapping Units. They are fixed functions units to map textures to geometry and I highly doubt they can be exploited for GPU-Grid. The same applies to ROPs: these are Raster Output Units, i.e. they deal with assembling the finalized images ("pushing the pixels"). We're not pixelating anything at GPU-Grid or in other GP-GPU apps. Think of GP-GPU work of endless loops of matrix and vector operations, which are all performed on the shaders. eXaPower wrote: could someone explain how a GTX980 shows 11digits after decimal point, while a GT650m has 12? That seems to be simply caused by the number of total digits being equal to 14. BTW: consider the variance in WU completion times. You can easily round those numbers to 3 significant digits, anything else will be drowned in "experimental noise" anyway: GTX980: 653.09405051673 -> 653 GT650m: 71.024125852776 -> 71.0 This also answers your other question: How does a much more powerful GPU have the smaller number? It doesn't, see the numbers above. BTW2: you also mention a factor of about 8 in performanc ebetween these cards, based on other measures. The factor between the processing rates quoted above matches this, approximately. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 38512 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38517 - Posted: 14 Oct 2014 \| 19:00:47 UTC - in response to Message 38512. Last modified: 14 Oct 2014 \| 19:10:38 UTC
	ETA- Thank you for explaining what processing rate numbers mean. Reason I mentioned Texture Mapping units-- http://multiscalelab.org/gianni/publications?action=AttachFile&do=get&... Texture Mapping Units are "capable of performing linear interpolation of values into multidimensional (up to 3D) arrays of ﬂoating point data." Quoted from from Matt's "Accelerating Biomolecular Dynamics in the Microsecond Time Scale"-- "The texture units are used to assist the calculation of the electrostatic and van der Waals terms by providing linearly interpolated values for the radial components of those functions from lookup tables." Along other processes.
	ID: 38517 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38523 - Posted: 14 Oct 2014 \| 22:19:03 UTC - in response to Message 38517.
	Thanks for pointing that out! The paper is from 2009, but I suspect the code has been enhanced since then, but not radically changed. Matt, can you briefly (or as lengthy as your time allows) comment on usage of non-shader blocks in GPUs? And regarding the current question: are you still using the TMUs for table lookup? (what a neat trick! :) And does the reduced number of TMUs in Maxwell affect performance? I suspect not, unless you're constantly hammering the TMUs with requests. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 38523 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : News : New acemdshort app 846

	About	Science	Volunteers	Performance	Forum	Join us	Donate