WU: NOELIA

Message boards : News : WU: NOELIA_INS1P

Author	Message
noelia Send message Joined: 5 Jul 12 Posts: 35 Credit: 393,375 RAC: 0 Level Scientific publications	Message 32663 - Posted: 3 Sep 2013 \| 16:31:37 UTC
	Hi all, New WUs in the long queue, big-box size, 120000 credits each. The batch has been previously tested and should not report any issues, but please comment any problems you might have. Noelia
	ID: 32663 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 32798 - Posted: 6 Sep 2013 \| 13:50:22 UTC
	These slow a GTX 460/768mb GPU to a crawl. Santis and Nathans run fine.
	ID: 32798 \| Rating: 0 \| rate: / Reply Quote

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 32809 - Posted: 6 Sep 2013 \| 17:38:17 UTC
	I run these on both the machine with the Titans and the one with the 2GB GTX 650Ti. The Titan box runs these (as well as Nathans and Santis) roughly three times as fast as the 650Ti does. I'm thinking about firing up my old box with the 590s in it to see what that will do in comparison. Operator. ____________
	ID: 32809 \| Rating: 0 \| rate: / Reply Quote

werdwerdus Send message Joined: 15 Apr 10 Posts: 123 Credit: 1,004,473,861 RAC: 0 Level Scientific publications	Message 32866 - Posted: 10 Sep 2013 \| 1:09:20 UTC
	running at 95% gpu load on gtx 660 ti in windows 7! really nice! ____________ XtremeSystems.org - #1 Team in GPUGrid
	ID: 32866 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32869 - Posted: 10 Sep 2013 \| 12:28:42 UTC - in response to Message 32866.
	running at 95% gpu load on gtx 660 ti in windows 7! really nice! Yes, I finally got one too. Running 92% steady on my 770 at 66°C. ____________ Greetings from TJ
	ID: 32869 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 32872 - Posted: 10 Sep 2013 \| 14:53:25 UTC
	It is running at 96% on my GTX 650 Ti (63 C with a side fan). At 60 percent complete, it looks like it will take 18 hours 15 minutes to complete. And no problems with the memory (774 MB used).
	ID: 32872 \| Rating: 0 \| rate: / Reply Quote

Vagelis Giannadakis Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level Scientific publications	Message 32928 - Posted: 13 Sep 2013 \| 11:05:39 UTC
	Just got 80-NOELIA_INS1P-5-15-RND4120_0. It really is putting my 650Ti through its paces! vagelis@vgserver:~$ gpuinfo Fan Speed : 54 % Gpu : 67 C Memory Usage Total : 1023 MB Used : 938 MB Free : 85 MB The WU is only at the start (1.76%) and estimates to take 22:44. I expect this to drop significantly. ____________
	ID: 32928 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 32929 - Posted: 13 Sep 2013 \| 12:44:21 UTC - in response to Message 32872.
	memory (774 MB used) Undoubtedly why they won't run on the GTX 460/768 cards. They work fine on my 1GB GPUs, but I have to abort them on the 460 in favor of Santi and Nathan WUs.
	ID: 32929 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32952 - Posted: 14 Sep 2013 \| 16:10:33 UTC
	Finally I had another Noelia WU. It ran steady on my 660 with a GPU load off 97-98%, Nathan's do only 88-89%. And it ran in one go, that means no termination and restart because off the simulation becoming unstable. I wish I could opt for Noelia WU's only. ____________ Greetings from TJ
	ID: 32952 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 32953 - Posted: 14 Sep 2013 \| 17:34:37 UTC - in response to Message 32929.
	memory (774 MB used) Undoubtedly why they won't run on the GTX 460/768 cards. They work fine on my 1GB GPUs, but I have to abort them on the 460 in favor of Santi and Nathan WUs. I noticed a lot of failures on the 400 series cards and almost posted about it, but wasn't sure why. I think you have explained it.
	ID: 32953 \| Rating: 0 \| rate: / Reply Quote

Vagelis Giannadakis Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level Scientific publications	Message 32956 - Posted: 14 Sep 2013 \| 18:29:58 UTC - in response to Message 32928.
	Just got 80-NOELIA_INS1P-5-15-RND4120_0. It really is putting my 650Ti through its paces! vagelis@vgserver:~$ gpuinfo Fan Speed : 54 % Gpu : 67 C Memory Usage Total : 1023 MB Used : 938 MB Free : 85 MB The WU is only at the start (1.76%) and estimates to take 22:44. I expect this to drop significantly. Finished successfully in 81,572.91 sec (22.7h) on the 650Ti (running on Linux). Didn't actually take 18-19 hours, like previous NOELIAs.. must be a more complex WU. 180k is sweet! :) ____________
	ID: 32956 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32966 - Posted: 15 Sep 2013 \| 11:36:12 UTC - in response to Message 32798.
	These slow a GTX 460/768mb GPU to a crawl. Santis and Nathans run fine. Sounds like the minimum GPU memory requirement is set too low for these, otherwise BOINC would refuse to run them on such cards. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 32966 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 32969 - Posted: 15 Sep 2013 \| 12:14:01 UTC - in response to Message 32966.
	These slow a GTX 460/768mb GPU to a crawl. Santis and Nathans run fine. Sounds like the minimum GPU memory requirement is set too low for these, otherwise BOINC would refuse to run them on such cards. MrS, can the minimum memory requirement be set for specific WUs or just for the app in general?
	ID: 32969 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32971 - Posted: 15 Sep 2013 \| 12:18:51 UTC - in response to Message 32969.
	I don't know, I've never set a BOINC server up or created WUs myself. But if I had programmed BOINC this would be a setting tagged to each WU, because that's the only way it makes sense. The entire credit system was based on the idea that different WUs can contain different contents. So I'm not certain, but expect it to be possible. And if not a feature request is in order, IMO. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 32971 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,907,936,851 RAC: 10,512,339 Level Scientific publications	Message 32972 - Posted: 15 Sep 2013 \| 12:22:56 UTC - in response to Message 32969.
	These slow a GTX 460/768mb GPU to a crawl. Santis and Nathans run fine. Sounds like the minimum GPU memory requirement is set too low for these, otherwise BOINC would refuse to run them on such cards. MrS, can the minimum memory requirement be set for specific WUs or just for the app in general? I believe it might need to be set at the plan_class level, which is between app and WU. So we might need to enable something like cuda55_himem, and create _INS1P WUs for that class only.
	ID: 32972 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33050 - Posted: 18 Sep 2013 \| 1:28:19 UTC
	I just had a work unit fail (on an otherwise completely-stable system): http://www.gpugrid.net/result.php?resultid=7285706 The error seems to indicate, to me, that it's likely a problem with the work unit. Can you (NOELIA) confirm that? Also, if there's anything else I can provide, let me know. Thanks, Jacob Name pnitrox118-NOELIA_INS1P-7-12-RND7320_4 Workunit 4777876 Created 16 Sep 2013 \| 19:44:18 UTC Sent 17 Sep 2013 \| 4:01:45 UTC Received 17 Sep 2013 \| 10:08:54 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 153764 Report deadline 22 Sep 2013 \| 4:01:45 UTC Run time 2.31 CPU time 2.13 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.14 (cuda42) Stderr output <core_client_version>7.2.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code -97 (0xffffff9f) </message> <stderr_txt> # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203] VERSION [42] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r325_00 : 32680 # Simulation unstable. Flag 9 value 992 # Simulation unstable. Flag 10 value 909 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) </stderr_txt> ]]>
	ID: 33050 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 286 Level Scientific publications	Message 33055 - Posted: 18 Sep 2013 \| 15:23:49 UTC - in response to Message 32972.
	It's failed on every other host too, so it's a bad workunit.
	ID: 33055 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33057 - Posted: 18 Sep 2013 \| 16:23:19 UTC - in response to Message 33055.
	Just spotted a NOELIA_INS1P at 123h into a run, and only at 43% complete! I suspended it and tried to get it to run on the other card. It started but the system became unresponsive (mouse stopped moving then started, but couldn't click on anything). Then CPU WU's started to fail and the system became totally unresponsive (to keyboard and mouse). The WU has already timed out on the server, pnitrox120-NOELIA_INS1P-3-12-RND7171_0 Would have been good to have spotted this earlier, yesterday, the day before, the day before that... Oh well, it's your loss too I guess. After hard powering down and cold starting the system up again, the WU resumed on the other card but says it had only run for 6h 45min. This suggests it went wonky around that time and ran cold thereafter (GPU was at 45C). It's running on an 8.03 app (Ubuntu 13.04, CUDA55) and it's looking like about 6 or 7 hours to completing, so I will let it run and keep an eye on it. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33057 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33058 - Posted: 18 Sep 2013 \| 16:27:35 UTC - in response to Message 33057.
	Just spotted a NOELIA_INS1P at 123h into a run, and only at 43% complete! I suspended it and tried to get it to run on the other card. It started but the system became unresponsive (mouse stopped moving then started, but couldn't click on anything). Then CPU WU's started to fail and the system became totally unresponsive (to keyboard and mouse). That is what still concerns me. I can take errors, but stalling a machine may be even worse than a crash.
	ID: 33058 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 33060 - Posted: 18 Sep 2013 \| 16:56:05 UTC - in response to Message 33057.
	Just spotted a NOELIA_INS1P at 123h into a run, and only at 43% complete! I suspended it and tried to get it to run on the other card. It started but the system became unresponsive (mouse stopped moving then started, but couldn't click on anything). Then CPU WU's started to fail and the system became totally unresponsive (to keyboard and mouse). The WU has already timed out on the server, pnitrox120-NOELIA_INS1P-3-12-RND7171_0 Would have been good to have spotted this earlier, yesterday, the day before, the day before that... Oops, dskagcommunity has it now: http://www.gpugrid.net/workunit.php?wuid=4771472
	ID: 33060 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33064 - Posted: 18 Sep 2013 \| 17:40:03 UTC - in response to Message 33060. Last modified: 18 Sep 2013 \| 17:41:10 UTC
	I wonder if it will behave itself for dskagcommunity? He's running it with v8.14 (cuda42). If my WU completes (<5h now) limited credits might go to both of us, or just not to me. Something else that's still broken! ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33064 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 33067 - Posted: 18 Sep 2013 \| 18:07:24 UTC
	Oh. I looked onto the machine but it seems the WU run normal. I think it will need slight bit more than your 5h :( But i will survive one WU with reduced credits. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 33067 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 286 Level Scientific publications	Message 33068 - Posted: 18 Sep 2013 \| 18:37:46 UTC - in response to Message 33067. Last modified: 18 Sep 2013 \| 18:42:01 UTC
	Oh. I looked onto the machine but it seems the WU run normal. I think it will need slight bit more than your 5h :( But i will survive one WU with reduced credits. While I was reading your words, this video just snapped into my mind. Sorry for being off topic.
	ID: 33068 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33070 - Posted: 18 Sep 2013 \| 19:29:08 UTC - in response to Message 33068.
	Yeah, I can see/hear why that popped into your head! This is how I felt, ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33070 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33090 - Posted: 19 Sep 2013 \| 8:49:40 UTC - in response to Message 33070. Last modified: 19 Sep 2013 \| 8:54:59 UTC
	The WU completed on both systems and both systems got partial credit, pnitrox120-NOELIA_INS1P-3-12-RND7171 7273360 154384 13 Sep 2013 \| 3:43:04 UTC 18 Sep 2013 \| 21:17:36 UTC Completed and validated 42,071.90 41,448.82 101,000.00 Long runs (8-12 hours on fastest card) v8.03 (cuda55) 7289613 117426 18 Sep 2013 \| 7:52:19 UTC 18 Sep 2013 \| 23:23:28 UTC Completed and validated 46,687.89 2,714.13 101,000.00 Long runs (8-12 hours on fastest card) v8.14 (cuda42) The 8.14 app produced a more informative stderr output, Stderr output <core_client_version>7.0.28</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 560 Ti] Platform [Windows] Rev [3203] VERSION [42] # SWAN Device 0 : # Name : GeForce GTX 560 Ti # ECC : Disabled # Global mem : 1279MB # Capability : 2.0 # PCI ID : 0000:04:00.0 # Device clock : 1520MHz # Memory clock : 1700MHz # Memory width : 320bit # Driver version : r301_07 : 30142 # GPU 0 : 60C # GPU 0 : 62C # GPU 0 : 63C # GPU 0 : 64C # GPU 0 : 65C # GPU 0 : 66C # GPU 0 : 67C # GPU 0 : 68C # GPU 0 : 69C # GPU 0 : 70C # Time per step (avg over 4200000 steps): 11.114 ms # Approximate elapsed time for entire WU: 46680.219 s 01:15:25 (2124): called boinc_finish </stderr_txt> ]]> Wouldn't it be better to include a time with the GPU temp changes? BTW. The # GPU 0 isn't needed every time you report a temp change. If there is only one GPU then reporting the name of the GPU is sufficient, if there is more than one GPU then report which GPU the device runs on and then only if that changes. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33090 \| Rating: 0 \| rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 33136 - Posted: 22 Sep 2013 \| 4:21:06 UTC
	Just wanted to say noelia recent wu are beating nathan by a long shot on the 780s. Avg gpu usage and mem usage is 80%/20% nathan v 90%/30% noelia. Her tasks also get a lot less access violations. Nicely done.
	ID: 33136 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33145 - Posted: 22 Sep 2013 \| 12:14:17 UTC - in response to Message 33136.
	Just wanted to say noelia recent wu are beating nathan by a long shot on the 780s. Avg gpu usage and mem usage is 80%/20% nathan v 90%/30% noelia. Her tasks also get a lot less access violations. Nicely done. Yes I see the same on my 770 as well. I even got a Noelia beta on my 660 and did a better performance than the other beta, Santi´s I think they where. ____________ Greetings from TJ
	ID: 33145 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 33169 - Posted: 23 Sep 2013 \| 14:22:36 UTC - in response to Message 33136.
	Just wanted to say noelia recent wu are beating nathan by a long shot on the 780s. Avg gpu usage and mem usage is 80%/20% nathan v 90%/30% noelia. The other side of the coin is that Noelia WUs cause CPU processes to slow somewhat and can bring WUs on the AMD GPUs to their knees (my systems all have 1 NV and 1 AMD each). These problems are not seen with either Nathan or Santi WUs.
	ID: 33169 \| Rating: 0 \| rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 33170 - Posted: 23 Sep 2013 \| 16:00:28 UTC
	Personally, I have no issues with gpus stealing cpu resources if needed. Id rather feed the roaring lion than the grasshopper.
	ID: 33170 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 33594 - Posted: 23 Oct 2013 \| 18:46:31 UTC
	Arrgh... potx21-NOELIA_INS1P-1-14-RND1061_1. Another one with this in the stderr output: The simulation has become unstable. Terminating to avoid lock-up No other failures for this WU. Let's see how the next guy does on it. ____________
	ID: 33594 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 33600 - Posted: 24 Oct 2013 \| 3:31:50 UTC - in response to Message 33594. Last modified: 24 Oct 2013 \| 3:32:23 UTC
	Let's see how the next guy does on it. Just fine, I see...Never mind. Move along. Nothing to see here. ____________
	ID: 33600 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33612 - Posted: 25 Oct 2013 \| 9:47:07 UTC - in response to Message 33594.
	Arrgh... potx21-NOELIA_INS1P-1-14-RND1061_1. Another one with this in the stderr output: The simulation has become unstable. Terminating to avoid lock-up No other failures for this WU. Let's see how the next guy does on it. A large part could be due to the fact that these WU's consume a GIG of vRam and OC
	ID: 33612 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 10,090,089,893 RAC: 13,230,254 Level Scientific publications	Message 33628 - Posted: 26 Oct 2013 \| 18:51:01 UTC
	My failure rate on these units is getting quite bad, three of them in the last couple of days and wingmen are doing them right. No problem with any other type of WUs. Any advise? 2x660GTi in linux, driver is 304.88 but it has been rock solid so far. No much info in stderr at least to my knowledge. <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> process exited with code 255 (0xff, -1) </message> <stderr_txt> </stderr_txt> ]]>
	ID: 33628 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33634 - Posted: 27 Oct 2013 \| 11:22:44 UTC - in response to Message 33628.
	4 errors from 67 WU's isn't very bad, but they are all NOELIA WU's, so there is a trend. You are completing some though. I had one fail on a Linux system with 304.88 drivers, but my system is prone to failures due to the GTX650TiBoost which has been somewhat troublesome in every system and setup I've used (the card operates too close to the edge). I also have two GPU's in my system and I got the same output, Compute error, process exited with code 255 (0xff, -1). ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33634 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33637 - Posted: 27 Oct 2013 \| 11:41:59 UTC - in response to Message 33634. Last modified: 27 Oct 2013 \| 11:43:10 UTC
	I have had a long run of success (61 straight valid GPUGrid tasks over the past 2 weeks!), including 4 successful NOELIA_INS1P tasks, on my multi-GPU Windows 8.1 x64 machine.
	ID: 33637 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 10,090,089,893 RAC: 13,230,254 Level Scientific publications	Message 33642 - Posted: 27 Oct 2013 \| 17:16:56 UTC - in response to Message 33634.
	4 errors from 67 WU's isn't very bad, but they are all NOELIA WU's, so there is a trend. You are completing some though. I had one fail on a Linux system with 304.88 drivers, but my system is prone to failures due to the GTX650TiBoost which has been somewhat troublesome in every system and setup I've used (the card operates too close to the edge). I also have two GPU's in my system and I got the same output, Compute error, process exited with code 255 (0xff, -1). 4 out of 67 is ok I agree, but it's around 50% for these NOELIA_INS1P, so the trend is there as you say. No other type has failed in the last months included other NOELIAS's types. The two following units of the same type after the last failure have completed right.... maybe the Moon influence :)
	ID: 33642 \| Rating: 0 \| rate: / Reply Quote

ritterm Send message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level Scientific publications	Message 33649 - Posted: 27 Oct 2013 \| 23:58:43 UTC - in response to Message 33612. Last modified: 28 Oct 2013 \| 0:01:39 UTC
	Betting Slip wrote: A large part could be due to the fact that these WU's consume a GIG of vRam and OC Yep, good call. I've got another one of these running. Afterburner shows memory usage at slightly more than 1.1GB...I suppose that's stressing my GTX 570 (1280MB), isn't it? ____________
	ID: 33649 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33651 - Posted: 28 Oct 2013 \| 7:48:08 UTC - in response to Message 33649.
	Betting Slip wrote: A large part could be due to the fact that these WU's consume a GIG of vRam and OC Yep, good call. I've got another one of these running. Afterburner shows memory usage at slightly more than 1.1GB...I suppose that's stressing my GTX 570 (1280MB), isn't it? I had one fail for becoming unstable on a GTX560 TI with same amount of memory. They should be OK on that amount but they're still failing. It's this sort of problem that scares away contributors and annoys the hell out of me.
	ID: 33651 \| Rating: 0 \| rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 33652 - Posted: 28 Oct 2013 \| 9:11:24 UTC Last modified: 30 Oct 2013 \| 12:43:02 UTC
	Apparently they have quite small error rate (<10%), so nothing systematic to worry about. I guess it's the same memory problem (WU's being too large) which has been troubling Noelia's WU's lately. Apparently these large ones should be finishing soon so it's gonna get better. As for the reason they cause problems to some I don't know :/
	ID: 33652 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33670 - Posted: 30 Oct 2013 \| 9:11:50 UTC - in response to Message 33652.
	I've only had 12 failures this month (that are still in the database), but 3 of the last 4 were NOELIA_INS1P tasks. If I include 2 recent NOELIA_FXArep failures that's 5 out of the last 6 failures. Of course I've been running more of Noelia's work recently, as there has been more tasks available. I've also been running some short SANTI_MAR tasks. These shorter runs would have less chance of failing so there would be little chance of seeing a trend in failures. I suspect most of my failures occur on the mid-range cards; GTX650TiBoost and GTX660, rather than the slightly bigger cards. Again, these mid-range cards tend to run closer to their power targets so there is more chance of failure. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33670 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33677 - Posted: 30 Oct 2013 \| 12:57:33 UTC Last modified: 30 Oct 2013 \| 12:59:56 UTC
	On my GTX 660's (Win7 64-bit with 331.58 drivers): Concerning slow running: that happened once in a while to me, though only on the 660s and never on my GTX 650 Ti. But someone mentioned the old trick of setting the Power Management Mode to "Prefer Maximum Performance" in the Nvidia control panel, and I have not had a problem since. It is a little inconvenient to get to that setting now, since I normally connect my display to the internal Intel graphics adapter, but I used to always set it that way when I was running the monitor directly from the Nvidia card. Concerning failures: I was getting occasional failures on various Noelias (not necessarily just INS1P), but only on one of my two cards, which was curious since they are supposedly identical. It turns out that the GPU core voltage setting on the one that failed was a little lower than the other, apparently because it was running into a power limit. So using MSI Afterburner, I raised the power limit (to 105%) and raised the voltage a little. That fixed it, and I have had no failures since. It is a truism that the Noelias work your card hard, and if there are any weaknesses, they will find them.
	ID: 33677 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33715 - Posted: 2 Nov 2013 \| 11:33:37 UTC - in response to Message 33677.
	If you want to avoid the reduced power efficiency which comes along with the increased voltage you could also scale GPU clock back by 13 or 26 MHz - should have the same stabilizing effect (but be a little slower and a little more power efficient). MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 33715 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33721 - Posted: 2 Nov 2013 \| 13:05:34 UTC - in response to Message 33715. Last modified: 2 Nov 2013 \| 13:29:46 UTC
	If you want to avoid the reduced power efficiency which comes along with the increased voltage you could also scale GPU clock back by 13 or 26 MHz - should have the same stabilizing effect (but be a little slower and a little more power efficient). MrS Actually, I do set back both cards by 10 MHz, but for a different reason. I found that the problem card still had the slowdown on a subsequent Noelia work unit. Then I remembered another old trick that sometimes works to keep the clocks going - let MSI Afterburner control them. It doesn't seem to matter whether you increase or decrease the clock rate from the default, or by what amount. My guess is that it takes control away from the Nvidia software, or whatever they use. At least it has been working for six days now, which is encouraging, if not proof. But such a small change in clock rate (it is very close to the Nvidia default of 980 MHz anyway) does not make any discernible change in temperature or power consumption as measured by GPU-Z. I would have to make a much larger change than that, which I will do if necessary. I think the chip on that particular card was just weak; when they test them, I am sure they don't run them through anything as rigorous as what we do here. I also had to bump up the voltage a little more - I started at 25 mv, but that wasn't quite enough, so now it is 37 mv. It has been error-free for a couple of days and three Noelias, but I need more Noelias to test it.
	ID: 33721 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33724 - Posted: 2 Nov 2013 \| 16:01:32 UTC - in response to Message 33721.
	The clock granularity of Keplers is 13 MHz, so you might want to keep to multiples of this. If you don't, it's being rounded - no problem, unless you change clocks a bit but it actually gets rounded to the same clock speed and doesn't change anything. And you're right, +/-10 MHz has a negligible effect on power consumption. What I was referring to was the increased power consumption from the voltage increase. It's not dramatic either (larger than what the frequency change causes), but it's something you might not want. And don't try to shoot for 0% error rate at GPU-Grid - I'm not sure this is actually possible and what it would depend on (OS, drivers etc.). If you do get occasional errors it's always a good idea to check whether the WU is also failing for your wingmen (which should have crunched it a few days after your attempt). MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 33724 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33729 - Posted: 2 Nov 2013 \| 17:43:01 UTC - in response to Message 33724.
	And you're right, +/-10 MHz has a negligible effect on power consumption. What I was referring to was the increased power consumption from the voltage increase. It's not dramatic either (larger than what the frequency change causes), but it's something you might not want. And don't try to shoot for 0% error rate at GPU-Grid - I'm not sure this is actually possible and what it would depend on (OS, drivers etc.). If you do get occasional errors it's always a good idea to check whether the WU is also failing for your wingmen (which should have crunched it a few days after your attempt). MrS There is a small effect from the voltage increase thus far, but not that much. The problem card (0) is in the top slot, and runs a couple of degrees hotter than the bottom card (1) even without the boost; typically 68 and 66 degrees C, probably due to air flow from the side fans (I have one of the few motherboards that puts the top card in the very top slot, which then raises the lower card up also). When I raise the voltage, it adds a degree (or less) to that on average. I probably should reverse their slot positions, but it is not that important yet. But the bottom card has done quite well - no errors in over a week; only the top card has had the errors. http://www.gpugrid.net/results.php?hostid=159002&offset=0&show_names=1&state=0&appid= I normally would buy Asus cards for better cooling (the non-overclocked versions), but needed the space-saving of these Zotac cards at the time. Now they are in a larger case, and I can replace them with anything if need be. The main point for me is that the big problems of a few months ago are past, for the moment.
	ID: 33729 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33743 - Posted: 3 Nov 2013 \| 13:28:23 UTC - in response to Message 33729.
	The projects tasks have been quite stable of late. The only recent exception being a small batch of WU's that failed quickly. So it's a good time to know if you have a stable system or not. In a system with 2 GPU's the top GPU is more likely to be the warmest because it's sandwiched between the CPU and the other GPU. If you have exhaust cooling GPU's then the side fans would be better blowing into the case. If not then these fans might be better blowing out (but it depends on the case and other fans). I have two GPU's in the one open case. Despite both having triple fans, the top card's temperature was 72 to 73C (with fans at 95%). I propped up 2 case fans to blow the air out from their sides (as they vent into the case). This dropped the top cards temperature to around 63C. It also raised the temperature of the bottom card, but only up to 55C :) ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33743 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33744 - Posted: 3 Nov 2013 \| 14:36:43 UTC - in response to Message 33743.
	If you have exhaust cooling GPU's then the side fans would be better blowing into the case. If not then these fans might be better blowing out (but it depends on the case and other fans). I have two 120 mm side fans blowing in, a 120 mm rear fan blowing out, and a top 140 mm fan blowing out (the power supply is bottom-mounted). I think that establishes the airflow over the GPUs pretty well, but you never know until you try it another way. As you point out, it can do strange things. However, the top temperature for the top card is about 70 C, which is reasonable enough. The real limitation on temperature now is probably just the heatsink/fans on the GPUs themselves. But my theory of why that card had errors has more to do with the power limit rather than temperature per se. It would bump up against the power limit (as shown by GPU-Z), and so the voltage (and/or current) to the GPU core could not increase any more when the Noelias needed it. By increasing the power limit to 105% and raising the base voltage, it can supply the current when it needs it. That particular chip just fell on the wrong end of the speed/power yield curve for number crunching use, though it would be fine for other purposes. And I can re-purpose it for other use if need be; it just needs to last until the Maxwells come out.
	ID: 33744 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33746 - Posted: 3 Nov 2013 \| 15:07:24 UTC - in response to Message 33744. Last modified: 3 Nov 2013 \| 15:12:26 UTC
	The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature". Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance. For me, I have: - 3 GPUs (eVGA GTX 660 Ti FTW 3GB, eVGA GTX 460 SC, NVIDIA/DELL GTS 240); the 660 Ti and the 460 are both factory overclocked, which I haven't touched - Intel i7 965XE quad-core hyperthreaded CPU, factory overclocked to 3742 Mhz - 1000-watt power supply (Dell XPS 730X case/system) - The GPUs run tasks 24/7 (GPUGrid only runs on the 660 Ti and the GTX 460)... alongside CPU fully loaded with CPU tasks - Precision-X setting the GTX 660 Ti to 140% Power Target (so it can upclock to max boost 1241 MHz without ever being limited by the 100% power limitation) - Precision-X fan curves set up so that max GTX 660 Ti fan speed (80%) occurs before the 70C mark (so I can keep max boost), and then max non-660-Ti speed (100%) occurs at 85C (I've had no problems with GPUs running that hot) - System fans set up to assist in keeping the 660 Ti nearly always below 70C (whereas, the defaul system fan settings would have the 660 Ti climb to 82C even if the GPU was at max-80% fan, as the system runs quite hot) - Normal temps for my "fully-loaded 24/7" system: CPU cores: 77-85C GTX 660 Ti: 67-71C GTX 460: 66-75C GTS 240: 75-80C - BOINC/GPUGrid errors that appear to be caused by any sort of hardware problem: NONE
	ID: 33746 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33748 - Posted: 3 Nov 2013 \| 15:21:53 UTC - in response to Message 33746.
	The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature". Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance. All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first.
	ID: 33748 \| Rating: 0 \| rate: / Reply Quote

GoodFodder Send message Joined: 4 Oct 12 Posts: 53 Credit: 333,467,496 RAC: 0 Level Scientific publications	Message 33786 - Posted: 6 Nov 2013 \| 11:31:02 UTC
	'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs). Hope gpugrid is not returning to these ridiculously large WUs again? If so I suspect volunteer base is going to head downwards - can't they be split up?
	ID: 33786 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33787 - Posted: 6 Nov 2013 \| 12:13:13 UTC - in response to Message 33786. Last modified: 6 Nov 2013 \| 12:25:26 UTC
	'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs). Hope gpugrid is not returning to these ridiculously large WUs again? If so I suspect volunteer base is going to head downwards - can't they be split up? You will struggle with this type of WU because one of your cards only has 1 GIG of memory and this Noelia unit uses 1.3 GIG but doesn't use much CPU. It will probably make any computer with a 1 GIG card unresponsive. I agree that the project is shooting itself in the foot by just dumping these WU's on machines that can't chew them http://www.gpugrid.net/forum_thread.php?id=3523
	ID: 33787 \| Rating: 0 \| rate: / Reply Quote

wdiz Send message Joined: 4 Nov 08 Posts: 20 Credit: 871,871,594 RAC: 0 Level Scientific publications	Message 33795 - Posted: 8 Nov 2013 \| 16:39:16 UTC - in response to Message 33786.
	'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs). Hope gpugrid is not returning to these ridiculously large WUs again? If so I suspect volunteer base is going to head downwards - can't they be split up? Same here, with GTX680 or GTX 580 Very long crunch !!!
	ID: 33795 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 33796 - Posted: 8 Nov 2013 \| 17:02:52 UTC Last modified: 8 Nov 2013 \| 17:05:43 UTC
	Oh its not me only again.. 32hours...560ti 448core 1,28GB -_- ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 33796 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33797 - Posted: 8 Nov 2013 \| 23:49:57 UTC - in response to Message 33795. Last modified: 8 Nov 2013 \| 23:51:15 UTC
	My 35x5-NOELIA_1MG_RUN4-2-4-RND8673_0 running on a GTX660Ti is at 34% and took 4h22min. So it should complete in about 13h (Win7x64). If you get a task that has been running too long. Check the temps, GPU usage... and do a system shut down and restart if something looks wrong (temps too low). Note that NOELIA_1MG tasks may not be similar to NOELIA_INS1p tasks. PS. Noticed that some of NOELIA's tasks now use a full CPU core/thread (but others still don't). ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33797 \| Rating: 0 \| rate: / Reply Quote

Jeremy Zimmerman Send message Joined: 13 Apr 13 Posts: 61 Credit: 726,605,417 RAC: 0 Level Scientific publications	Message 33799 - Posted: 9 Nov 2013 \| 0:21:19 UTC - in response to Message 33797.
	These NOELIA_1MG are about 8-9 hours on GTX680 with 2Gb Memory http://www.gpugrid.net/result.php?resultid=7444885 http://www.gpugrid.net/result.php?resultid=7443692 and around 34 hours on GTX460 with 1Gb Memory http://www.gpugrid.net/result.php?resultid=7440928 Same thing happened between the 768Mb and 1024Mb division in the past. Now we move past the 1024 minimum for some WU's.
	ID: 33799 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33801 - Posted: 9 Nov 2013 \| 0:51:48 UTC - in response to Message 33797.
	My 35x5-NOELIA_1MG_RUN4-2-4-RND8673_0 running on a GTX660Ti is at 34% and took 4h22min. So it should complete in about 13h (Win7x64). If you get a task that has been running too long. Check the temps, GPU usage... and do a system shut down and restart if something looks wrong (temps too low). Note that NOELIA_1MG tasks may not be similar to NOELIA_INS1p tasks. PS. Noticed that some of NOELIA's tasks now use a full CPU core/thread (but others still don't). On a GTX660TI with 2GB memory NO PROBLEM but this post all about those cards with less than 2GB
	ID: 33801 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33806 - Posted: 9 Nov 2013 \| 10:15:54 UTC - in response to Message 33801. Last modified: 9 Nov 2013 \| 10:17:29 UTC
	The NOELIA_1MG WU I'm presently running is using 1.2GB GDDR5, so it wouldn't do well on a 1GB card. Cards impacted by this would be anything at or below 1GB and possibly other cards under some conditions. This includes, Most versions of the GT 440 and GTS450, all versions of the GTX460 and GTX465. The GT 545 (GDDR5 version), some GTX550Ti’s, some GTX560’s and some GTX560Ti's Some GT 640's, the GTX 645, some GTX650's and GTX650Ti's. The 1280MB cards that might be impacted are the GTX470, GTX560Ti448 and GTX570. Would be interesting to know how much GDDR was being used on the different operating sytsems (XP, Linux, Vista, W7, W8). Note that where larger memory versions exist, they tend to be more expensive so not many people buy them. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 33806 \| Rating: 0 \| rate: / Reply Quote

Dagorath Send message Joined: 16 Mar 11 Posts: 509 Credit: 179,005,236 RAC: 0 Level Scientific publications	Message 33812 - Posted: 9 Nov 2013 \| 13:18:58 UTC - in response to Message 33806.
	Would be interesting to know how much GDDR was being used on the different operating sytsems (XP, Linux, Vista, W7, W8). I'm not sure if we're talking the same error here but I had potx234-NOELIA_INS1P-12-14-RND6963_0 fail on my 660Ti with 3 gig mem on Linux, driver 331.17, more details here. That task also failed on this host (1 gig, Linux, 560Ti, driver unknown) but succeeded on this host , (2 gig, Win7, 2 X 680). I've had 4 other Noelia run successfully on my 660Ti on Linux. ____________ BOINC <<--- credit whores, pedants, alien hunters
	ID: 33812 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33821 - Posted: 10 Nov 2013 \| 14:38:07 UTC - in response to Message 33748.
	The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature". Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance. All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first. Hi Jacob.. I suppose you wouldn't mind going a bit deeper? To make a transistor switch (at a very high level) you apply a voltage which in turn pulls electrons through the channel (or "missing electrons" aka holes in the other direction). This physical movement of charge carriers is needed to make it switch. And it takes some time, which ultimately limits the clock speeds a chip can reach. This is where temperature and voltage must be considered. The voltage is a measure for how hard the electrons are pulled, or how quickly they're accelerated. That's why the maximum achievable (error-free) frequency scales approximately linear with voltage. Temperature is a measure for the vibrations of the atomic lattice. Without any vibrations the electrons wouldn't "see" the lattice at all. The atoms (in a single crystal) are forming a perfectly periodic potential landscape, through which the electrons move as waves. If this periodic structure is disturbed (e.g. by random fluctuations caused by temperature > 0 K), the electrons scatter with these perturbations. This slows their movement down and heats the lattice up (like in a regular resistor). In a real chip there are chains of transistors, which all have to switch within each clock cycle. In CPUs each stage of the pipeline is such a domain. If individual transistors are switching too slow, the computation result will not have reached the output stage of that domain yet when the next clock cycle is triggered. The old result (or something in between, depending on how the result is composed) will be used as the input for the next stage and a computation error will have occurred. That's why timing analysis is so important when designing a chip - the slowest path limits the overall clock speed the chip can achieve. And putting it all together it should be more clear now how increased temperature and too low voltage can lead to errors. And to get a bit closer to reality: the real switching speed of each transistor is affected by many more factors, including fabrication tolerances, non-fatal defects (which also scatter electrons and hence slow them down as well), defects developed due to operating the chip under prolonged load (at high temperature and voltage). At this point I can hand over to Jim: the manufacturer profiles their chips and determines proper working points (clock speed & voltage at maximum allowed temperature). Depending on how careful they do this (e.g. Intel usually allows for plenty of head room, whereas factory OC'ed GPUs have occasionally been set too agressive) things work out just normally.. or the end user could see calculation errors. Mostly these will only appear under unuausl work loads (which weren't tested for) or after significant chip degradation. Or just due to bad luck, which wasn't caught by the initial IC error testing (which is seldom, luckily). Hope this helps :) MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 33821 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33823 - Posted: 10 Nov 2013 \| 16:18:22 UTC Last modified: 10 Nov 2013 \| 16:19:50 UTC
	It does help, thank you very much for the detailed explanations. I've read through it once, and I'll have to read through it again a few more times for it to sink in. I actually studied Computer Engineering for a few years before switching over to a Bachelor's degree in Computer Information Systems. But, I still don't quite understand one other thing. When you overclock a CPU too far, you usually get a BSOD (presumably because the execution pointer is off in no-man's land, or because the data got jacked up in the pipeline, or both), right? But what about going "too far" on a GPU? The scenario I'm looking to better define is: overclocking or overheating a GPU too far to cause GPUGrid problems, but not far enough to cause Windows problems or driver resets. In order to get these BOINC Computation Errors, then that scenario must exist, right? Why doesn't Windows catch this and explain the error to the user?
	ID: 33823 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33847 - Posted: 12 Nov 2013 \| 21:34:58 UTC - in response to Message 33823.
	Windows can't catch these calculation errors because, frankly, it doesn't see them. The GPU-Grid app sends some commands to the GPU, the GPU processes something and returns results to the app. Unless the GPU behaves in any different way (doesn't respond any more etc.), there's no way for the OS to tell if the data returned is correct or garbage. Specifically not even GPU-Grid can now this, unless they already know the result.. but they can check their results for sanity and, luckily for us, errors may often have no effect (on the long-term simulation result) or catastrophic effects. I suppose molecular dynamics is comparably tolerant to single calculation errors. Imagine it this way: if a force is calculated too large in one time step and as a result an atom is moved further than it should it timestep n, then it will likely get too close to other atoms in time step n+1 and hence recieve a greater repelling force than what it would have gotten in the correct position. Thus small errors don't build up over time. Not sure it really works like this.. but I think Matt once said something which sounded to me like this :) MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 33847 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 729,045,933 RAC: 79,164 Level Scientific publications	Message 33975 - Posted: 22 Nov 2013 \| 1:26:42 UTC - in response to Message 33821.
	The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature". Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance. All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first. [snip] MrS Something I've read that seems relevant to this explanation: Today's CPU chips are approaching the lower limit of the voltages at which the transistors work properly. Therefore, the power used by each CPU core can't get much lower. Instead, the companies are increasing the total speed by putting more CPU cores in each CPU package. Intel in also using a different method - hyperthreading. This method gives each CPU core two sets of registers, so that while the CPU is waiting for memory operations for the program running with one of these sets, the CPU can use the other set to run the other program using that set. This makes the CPU act as if it had twice as many CPU cores as it actually does. If a programmer want to use more than one of these CPU cores at the same time for the same program, that programmer must study parallel programming first, in order to handle the communications between the different CPU cores properly. I used to be an electronic engineer, specializing in logic simulation, often including timing analysis.
	ID: 33975 \| Rating: 0 \| rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 36876 - Posted: 20 May 2014 \| 10:25:04 UTC Last modified: 20 May 2014 \| 10:26:16 UTC
	Just crunched my fist one on a GTX 770 at 76° in 31,145.32. Nice 153,150.00 Points ;-) ____________ Regards, Josef
	ID: 36876 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 36877 - Posted: 20 May 2014 \| 11:22:37 UTC - in response to Message 36876.
	Just crunched my fist one on a GTX 770 at 76° in 31,145.32. Nice 153,150.00 Points ;-) You finished indeed a Noelia WU, but not this one but the new one: NOELIA_BI. But more important, your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C. And the colder a GPU runs, the faster (and more error free) it does. So perhaps you can experiment with some settings to get the temperature a few degrees lower. ____________ Greetings from TJ
	ID: 36877 \| Rating: 0 \| rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 36880 - Posted: 20 May 2014 \| 13:31:13 UTC - in response to Message 36877.
	[quote]your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C. THX for your advice. To lower the temperature, I'm usig the nvidia inspektor with the following settings: I unchecked Auto-Fan and set it to 60% which speeds the fan from 1300 to 1770 1/min what is still ear-friedly. But that reduces the temperature by only 3 degrees. So I have to check the Priorize Temperature box and put the slider to 68°. Which slows down cpu-clock a little bit. Is there a better approach? ____________ Regards, Josef
	ID: 36880 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 36883 - Posted: 21 May 2014 \| 14:49:46 UTC - in response to Message 36877. Last modified: 21 May 2014 \| 14:50:16 UTC
	But more important, your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C. And the colder a GPU runs, the faster (and more error free) it does. So perhaps you can experiment with some settings to get the temperature a few degrees lower. You might have accidently been looking at your 780 Ti. Here's your 3 Noelia results from the 770 so far: # GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42] # Approximate elapsed time for entire WU: 29643.715 s # GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42] # Approximate elapsed time for entire WU: 29572.861 s # GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42] # Approximate elapsed time for entire WU: 29676.489 s
	ID: 36883 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 36886 - Posted: 21 May 2014 \| 17:26:50 UTC - in response to Message 36883.
	You are absolutely correct Beyond, my mistake. Sorry for that MrJo. Still a difference of 2000 seconds. I have never seen nVidia inspector before, I use PrecisionX from EVGA or MSI's Afterburner. I have set a fan curve that goes to 100% at 70°C but the card is allowed to go to 75% before the program must throttle the GPU clock. Power target is set to 100%. Currently with ambient temperature of 32.6°C the 770 runs at 68°C and 1149MHz. Sits in the second slot, the first is occupied by the 780Ti. Hope this helps a bit. ____________ Greetings from TJ
	ID: 36886 \| Rating: 0 \| rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 36888 - Posted: 22 May 2014 \| 5:22:01 UTC
	Now I have tested the MSI Afterburner. There you can set a custom fan curve. However, I have a problem with that: In order to lower the temperature by 3-4 ° C, the fan speed increases to 3300 1/min. This is unpleasant. With my GTX 680 I was able to reduce the temperature by 8 degrees, as I dismounted the cooler and renewed the thermal paste;-) Unfortunately, the same procedure for the GTX 770 delivered nothing, since their thermal paste was not dried out. Too new;-) So I will reduce gpu-clock a little bit to remain below 70 degrees. Reducing from 1150 MHz to 1080-1100 reduces the temperature by 5 degrees. ____________ Regards, Josef
	ID: 36888 \| Rating: 0 \| rate: / Reply Quote

GoodFodder Send message Joined: 4 Oct 12 Posts: 53 Credit: 333,467,496 RAC: 0 Level Scientific publications	Message 37368 - Posted: 23 Jul 2014 \| 8:06:38 UTC
	Hi, potx1x225-NOELIA_INSP-5-13-RND8250_1: Have a odd error - task failed within 3secs. Hopefully it is a one off and not a bad batch; however in case it is not: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile 23:01:01 (3684): called boinc_finish http://www.gpugrid.net/result.php?resultid=12864293
	ID: 37368 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,514,072,716 RAC: 11,555,129 Level Scientific publications	Message 37369 - Posted: 23 Jul 2014 \| 10:01:55 UTC - in response to Message 37368.
	Hi, potx1x225-NOELIA_INSP-5-13-RND8250_1: Have a odd error - task failed within 3secs. Hopefully it is a one off and not a bad batch; however in case it is not: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile 23:01:01 (3684): called boinc_finish http://www.gpugrid.net/result.php?resultid=12864293 I had the same error in 4 units so far. Here is an example of one: potx1x492-NOELIA_INSP-3-13-RND4560_6 Workunit 9908013 Created 22 Jul 2014 \| 19:40:28 UTC Sent 22 Jul 2014 \| 21:46:12 UTC Received 22 Jul 2014 \| 23:03:18 UTC Server state Over Outcome Computation error Client state Compute error Exit status -98 (0xffffffffffffff9e) Unknown error number Computer ID 127986 Report deadline 27 Jul 2014 \| 21:46:12 UTC Run time 4.05 CPU time 2.06 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.41 (cuda60) Stderr output <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -98 (0xffffff9e) </message> <stderr_txt> # GPU [GeForce GTX 690] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 690 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:04:00.0 # Device clock : 1019MHz # Memory clock : 3004MHz # Memory width : 256bit # Driver version : r337_00 : 33788 ERROR: file mdioload.cpp line 81: Unable to read bincoordfile 19:03:38 (5576): called boinc_finish </stderr_txt> ]]> http://www.gpugrid.net/result.php?resultid=12864314
	ID: 37369 \| Rating: 0 \| rate: / Reply Quote

Grubix Send message Joined: 26 Sep 08 Posts: 4 Credit: 321,147,075 RAC: 0 Level Scientific publications	Message 37370 - Posted: 23 Jul 2014 \| 10:14:07 UTC
	Same error here: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile potx1x284-NOELIA_INSP-2-13-RND0923 : WU 9908067 potx1x225-NOELIA_INSP-5-13-RND8250 : WU 9907982 Bye, Grubix.
	ID: 37370 \| Rating: 0 \| rate: / Reply Quote

Vagelis Giannadakis Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level Scientific publications	Message 37372 - Posted: 23 Jul 2014 \| 10:20:25 UTC
	This error does not affect NOELIAs only, I had a SANTI_p53final fail on me the other day with the exact same error: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile ____________
	ID: 37372 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : News : WU: NOELIA_INS1P

	About	Science	Volunteers	Performance	Forum	Join us	Donate