suddenly too many errors

Message boards : Graphics cards (GPUs) : suddenly too many errors

Author	Message
Viktor Svantner Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,512,512,523 RAC: 76,361 Level Scientific publications	Message 42037 - Posted: 27 Oct 2015 \| 21:26:41 UTC Last modified: 27 Oct 2015 \| 21:28:48 UTC
	I have had recently too many errors with my 24/7 rig containing triple GTX 970: https://www.gpugrid.net/results.php?userid=73475 Links: https://www.gpugrid.net/result.php?resultid=14642662 https://www.gpugrid.net/result.php?resultid=14642713 On my 2nd rig 24/7 with double GTX 980Ti goes sometimes like: https://www.gpugrid.net/result.php?resultid=14641885 ANY HELP NEEDED as my 1st rig is switched off at the moment due to these troubles. PS: No changes made to the hardware so far. Previously no problem, suddenly this. ____________
	ID: 42037 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 42038 - Posted: 27 Oct 2015 \| 23:43:24 UTC - in response to Message 42037. Last modified: 27 Oct 2015 \| 23:44:05 UTC
	I have had recently too many errors with my 24/7 rig containing triple GTX 970: https://www.gpugrid.net/results.php?userid=73475 Links: https://www.gpugrid.net/result.php?resultid=14642662 Excerpt from this task's stderr.txt: # GPU 1 : 78C # GPU 0 : 89C # GPU 1 : 80C # GPU 0 : 91C # GPU 0 : 93C # GPU 0 : 95C # GPU 1 : 82C These temperatures are too high. You'll fry your cards. But the error message at the end of the output file says: SWAN : FATAL Unable to load module .mshake_kernel.cu. (999) It usually happens when you stop a task too early after it's started. https://www.gpugrid.net/result.php?resultid=14642713 Another excerpt from this task's stderr.txt: # GPU 2 : 63C # GPU 0 : 93C # GPU 1 : 83C # GPU 0 : 94C # GPU 1 : 84C # GPU 0 : 95C # GPU 1 : 85C 95°C is way too high! I suspect that these cards have non standard cooling with axial fans, and emit the heat inside the case, heating each other. You should use only one such card in this computer, or at least replace one of the card to have at least one slot space between the two cards for proper airflow, and install some fans which remove the hot air from the case. On my 2nd rig 24/7 with double GTX 980Ti goes sometimes like: https://www.gpugrid.net/result.php?resultid=14641885Perhaps you should lower its GPU clock a little to increase it's stability. ANY HELP NEEDED as my 1st rig is switched off at the moment due to these troubles. PS: No changes made to the hardware so far. Previously no problem, suddenly this. You will (if not already have) damage your cards permanently if you run them above 80°C. Every 10°C rise in temperatures halve the lifetime of the card, but above 80°C every 5°C rise does the same. Above 90°C there's a high chance of an immediate fatal failure of the GPU chip.
	ID: 42038 \| Rating: 0 \| rate: / Reply Quote

Viktor Svantner Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,512,512,523 RAC: 76,361 Level Scientific publications	Message 42040 - Posted: 28 Oct 2015 \| 6:18:18 UTC - in response to Message 42038. Last modified: 28 Oct 2015 \| 6:46:26 UTC
	So, I guess that I have to put the temperatures down and it should be fine. I just don´t understand why this happened after say 8 months of standard working at the same conditions. BTW, this one is fine with temperatures, but was also an error: https://www.gpugrid.net/result.php?resultid=14644966 or https://www.gpugrid.net/result.php?resultid=14641885 Any clues here? Or this just sometimes happens?
	ID: 42040 \| Rating: 0 \| rate: / Reply Quote

Viktor Svantner Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,512,512,523 RAC: 76,361 Level Scientific publications	Message 42042 - Posted: 28 Oct 2015 \| 11:01:39 UTC - in response to Message 42038.
	or this one, just recently: https://www.gpugrid.net/result.php?resultid=14645743 ____________
	ID: 42042 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 42043 - Posted: 28 Oct 2015 \| 11:18:10 UTC - in response to Message 42042. Last modified: 28 Oct 2015 \| 11:19:07 UTC
	or this one, just recently: https://www.gpugrid.net/result.php?resultid=14645743 This is an overly overclocked card, perhaps you should reduce the memory clock to 3505MHz, and if it didn't help then the GPU clock by 20MHz decrements, until it gets stable.
	ID: 42043 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 42044 - Posted: 28 Oct 2015 \| 11:20:26 UTC - in response to Message 42042. Last modified: 28 Oct 2015 \| 11:26:40 UTC
	or this one, just recently: https://www.gpugrid.net/result.php?resultid=14645743 Bare with me on this one Viktor Install your graphics driver again but over the top of the last one. I had a situation like this a few years ago with dual cards and a new driver always needed to be installed twice. It could be what Retvari said about OC. Hey, its worth a shot. Don't forget to suspend any running GPUGrid WU's
	ID: 42044 \| Rating: 0 \| rate: / Reply Quote

Viktor Svantner Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,512,512,523 RAC: 76,361 Level Scientific publications	Message 42046 - Posted: 28 Oct 2015 \| 14:19:39 UTC - in response to Message 42044. Last modified: 28 Oct 2015 \| 14:20:01 UTC
	O.K. I´ll try. It is so annoying. As I said before, with the same OC no problem for months and now this. BTW, it´s factory overclocked. ____________
	ID: 42046 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 42047 - Posted: 28 Oct 2015 \| 16:46:07 UTC - in response to Message 42046.
	BTW, it´s factory overclocked. Everyone, who use factory overclocked cards (including me) should remember: The factory made these cards to play games on them 4-5 hours per day, not to crunch on them in 24 hours of 7 days of week. Nothing severe happens, when there's a glitch in a frame while you are playing, but when this glitch occur while crunching a workunit, this will result in an error, and you'll lose the actual workunit, and the time and the electricity. If this happens too often, then the time lost to the failed workunits could easily exceed the time gained by the faster processing, making the overclocking counter-productive. So in the terms of overclocking: less is more.
	ID: 42047 \| Rating: 0 \| rate: / Reply Quote

fzb Send message Joined: 27 Mar 09 Posts: 1 Credit: 103,615,743 RAC: 0 Level Scientific publications	Message 42050 - Posted: 28 Oct 2015 \| 18:32:41 UTC
	this might Sound trivial but if it worked for months and now the temps are higher, you might try clean dust that gathered on the cards/coolers and check the temps after that
	ID: 42050 \| Rating: 0 \| rate: / Reply Quote

Dayle Diamond Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level Scientific publications	Message 42055 - Posted: 29 Oct 2015 \| 5:52:24 UTC - in response to Message 42050.
	I second FZB on this. I too recently had a work unit become unstable and fail on me after months of quiet. Just cleaned out my two 970s tonight after at least six months left alone in my case. Used a 'Datavac electric duster' instead of an air can. In high dust environments like mine, the cans just don't last too long. Disgusting amounts of dust flew everywhere. I'm getting much cooler temps!
	ID: 42055 \| Rating: 0 \| rate: / Reply Quote

Viktor Svantner Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,512,512,523 RAC: 76,361 Level Scientific publications	Message 42056 - Posted: 29 Oct 2015 \| 6:03:01 UTC - in response to Message 42047.
	I buy OC cards not because of the OC, but because of the quality build. The temperatures were high because of the 3way sli setup (no room between), I already made changes. Thanks heaps all of You for help.
	ID: 42056 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : suddenly too many errors

	About	Science	Volunteers	Performance	Forum	Join us	Donate