Author |
Message |
ritterm Send message
Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level
Scientific publications
|
My normally rock solid GTX570 host kicked out a couple of compute errors on short tasks:
glumetx5-NOELIA_SH2-13-50-RND5814 (Others seemed to have problems with this one)
prolysx8-NOELIA_SH2-13-50-RND2399_0
Both stderr outputs include:
SWAN : FATAL Unable to load module .mshake_kernel.cu. (702)
Both occurrences resulted in a driver crash and system reboot.
Possibly related question/issue... Are those GPU temps in the stderr output? Could that be part of the problem? I checked other successful tasks and have seen higher values than those in the recently crashed tasks.
____________
|
|
|
ritterm Send message
Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level
Scientific publications
|
My normally rock solid GTX570 host kicked out a couple of compute errors on short tasks...
...Probably because the GPU suffered a partial failure. Since this happened, the host would run fine under little or no load. As soon as the GPU got stressed running any BOINC tasks I threw at it, the machine would eventually crash and reboot.
The fan was getting a little noisy and there were signs of some kind of oily liquid on the enclosure. Fortunately, is was still under warranty and EVGA sent me a refurbished GTX 570 under RMA. A virtually painless process -- thanks, EVGA.
Maybe I should wait until the replacement GPU runs a few tasks successfully, but everything looks good, so far.
____________
|
|
|
|
Well, I'm getting it on long tasks too.
This was from a brand new GTX 970 SSC from EVGA.
Name 20mgx1069-NOELIA_20MG2-14-50-RND0261_0
Workunit 10253503
Created 4 Nov 2014 | 3:29:32 UTC
Sent 4 Nov 2014 | 4:29:42 UTC
Received 4 Nov 2014 | 14:45:46 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -52 (0xffffffffffffffcc) Unknown error number
Computer ID 140554
Report deadline 9 Nov 2014 | 4:29:42 UTC
Run time 21,616.47
CPU time 4,273.91
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65)
Stderr output
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -52 (0xffffffcc)
</message>
<stderr_txt>
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r344_32 : 34448
# GPU 0 : 56C
# GPU 0 : 57C
# GPU 0 : 58C
# GPU 0 : 59C
# GPU 0 : 60C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r344_32 : 34448
# GPU 0 : 46C
# GPU 0 : 47C
# GPU 0 : 50C
# GPU 0 : 51C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r344_32 : 34448
# GPU 0 : 45C
# GPU 0 : 46C
# GPU 0 : 48C
# GPU 0 : 50C
# GPU 0 : 52C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# GPU 0 : 58C
# GPU 0 : 59C
# GPU 0 : 60C
# GPU 0 : 61C
# GPU 0 : 62C
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r344_32 : 34448
SWAN : FATAL Unable to load module .mshake_kernel.cu. (999)
</stderr_txt>
]]> |
|
|
|
Now I got five "Unable to load module" crashes in a day.
Some crashed a few seconds into their run, some of them crashed after many hours of computation.
Last ones caused a blue screen and restart.
I replaced my old GTX 460 with a 1200 watt PSU and two 970's to make a big impact with BOINC GPU projects, but the frequent crashes are erasing much of my gains.
|
|
|
|
Dayle, I was getting roughly similar problems: sometimes "the simulation has become unstable", sometimes "failed to load *.cu". Also system crashes and blue screens.
I had bought a new GTX970 and it was running fine for a week. I then added my previous GTX660Ti back into the case, running 2 big GPUs for the 1st time. I've got both cards "eco-tuned", running power-limited. Yet it seems like they increase case temperatures pushed my OC'ed CPU over the stability boundary. Since I lowered the CPU clock speed a notch there have been no more failures. Well, that's only been 1.5 days by now, but it's still a record.
Morale: maybe the heat output from those GPUs also stressing some other component of your system too much.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
That was a very interesting idea, so I went ahead and looked at the work unit logs for all five crashes.
The ambient temperature in the room fluctuates depending on the time of day, but here is EACH GPU's temp whenever one OR the other failed.
All numbers in C
1. 64 & 58
2. 71 & 77
3. 58 & 63
4. 58 & 46
5. 71 & 77
77 degrees is much hotter then I'm hoping they'd run at, and I wonder if you're right. If so, it's time for a new case. I've got both the right and left panels of my tower disconnected, plus a HEPA filter in the room to keep dust from getting in. But maybe my airflow isn't directed enough? But that doesn't seem to be all of the problem, because they're crashing at much lower temperatures too. |
|
|
|
That was a very interesting idea, so I went ahead and looked at the work unit logs for all five crashes.
The ambient temperature in the room fluctuates depending on the time of day, but here is EACH GPU's temp whenever one OR the other failed.
All numbers in C
1. 64 & 58
2. 71 & 77
3. 58 & 63
4. 58 & 46
5. 71 & 77
77 degrees is much hotter then I'm hoping they'd run at, and I wonder if you're right. If so, it's time for a new case. I've got both the right and left panels of my tower disconnected, plus a HEPA filter in the room to keep dust from getting in. But maybe my airflow isn't directed enough? But that doesn't seem to be all of the problem, because they're crashing at much lower temperatures too.
That's your GPU temps, which are within the range for your GPU, just. IIRC your GPU Thermal throttles at 80c. It may be worth either reducing your clocks or employing a more agressive fan profile.
What Apes was referring to was CPU temps. If your GPU is dumping enough hot air into your case, it could be making your cpu unstable. Check those temps and adjust accordingly.
|
|
|
|
Okay, done. I've just recovered from another crash.
CPU is about 53 Celsius.
I'm mystified. I've let it run a little longer, and we're down to 52 C.
I did manage to see the blue screen for a split second and but it went away too quickly to take a photo. Something like "IRQL not less or equal".
Internet says that's usually a driver issue.
As I have the latest GPU drivers, latest motherboard drivers, etc, I am running "WhoCrashed" on my system and waiting for another crash.
Hopefully this is related to the Unable to Load Module issue.
|
|
|
|
Hopefully this is related to the Unable to Load Module issue.
Very probably. You temperatures seem fine. That blue screen message you got can be a software (driver) problem. Did you already try to do a clean driver install of the current 344.75?
I think that message can also mean just a general hardware failure. The PSU could be another candidate, but a new 1.2 kW unit sounds good. Is it, by any chance, a cheap chinese model?
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
It ist NOT a Driver Problem.
The NVIDIA Driver before crashed also.
see also
http://www.gpugrid.net/forum_thread.php?id=3932
regards |
|
|
|
Hi Mr. S.
I don't know if it's a cheap Chinese model, it's the Platimax 1200w, which is discontinued. Picked up the last one Fry's had in their system, then special ordered the cables, because some tosser returned theirs and kept all the cords.
I've attached my GTX 970s to a new motherboard that I was able to afford during a black Friday sale. I'll post elsewhere about that, because I'm still not getting the speed I'm expecting. Anyway, same drivers, same GPUs, same PSU, but better fans and motherboard. No more errors.
If anyone is still getting this error, I hope that helps narrow down your issue.
|
|
|
|
Huh. Well after a few years, this error is back, and swallowed 21 hours worth of Maxwell crunching.
Two years ago it was happening on an older motherboard, with different drivers, running different tasks, and on a different OS.
https://www.gpugrid.net/result.php?resultid=15094951
Various PC temps still look fine.
Name 2d9wR8-SDOERR_opm996-0-1-RND7930_0
Workunit 11595346
Created 9 May 2016 | 9:59:05 UTC
Sent 10 May 2016 | 7:14:49 UTC
Received 11 May 2016 | 7:15:14 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -52 (0xffffffffffffffcc) Unknown error number
Computer ID 191317
Report deadline 15 May 2016 | 7:14:49 UTC
Run time 78,815.07
CPU time 28,671.50
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output
<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -52 (0xffffffcc)
</message>
<stderr_txt>
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r364_69 : 36510
# GPU 0 : 66C
# GPU 1 : 72C
# GPU 0 : 67C
# GPU 1 : 73C
# GPU 0 : 68C
# GPU 1 : 74C
# GPU 0 : 69C
# GPU 0 : 70C
# GPU 0 : 71C
# GPU 0 : 72C
# GPU 1 : 75C
# GPU 0 : 73C
# GPU 0 : 74C
# GPU 0 : 75C
# GPU 1 : 76C
# GPU 0 : 76C
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r364_69 : 36510
SWAN : FATAL Unable to load module .mshake_kernel.cu. (719)
</stderr_txt>
]]>
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
# GPU 1 : 76C
# GPU 0 : 76C
76C is too hot!
Use NVIDIA Inspector to Prioritize Temperature and set it to 69C.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
From my experience, GPUs can run 70-85*C, no problem, so long as the clocks are stable. See if removing any GPU overclocks entirely, fixes the issue or not. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
The issue isn't with the GPU core temperature, it's with the heat generated by it; that increases the ambient temperature inside the GPU case and the computer chassis in general. Sometimes it causes failures when the GDDR heats up too much for example, sometimes system memory can become too hot, sometimes other components such as the disk drives. Generally when temps are over 50C they can cause problems.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
I just had this error on a fairly new system. It is not new by technological standards, but it is new as in it was bought brand new from the store as of less than 6 months ago. I find it interesting that the heat on the GPU core tops out at 58C and had this issue. The card itself has gone to 66C recently with no issue and when it was doing long tasks would flatten out at around 59-61C. Being a GT 730 2GB card, I have it running only short tasks like my laptop is doing now. (I set my fast cards to only do long tasks as well, as I think that is the polite thing to do for weaker cards so they can get short ones in to run.) AFAIK, this PC is not in an area that is hot or cold, but maintains a steady(ish) air temp, although it is next to a door and can get bursts of cooler air as people come in and out the front door during this fall/winter weather. It certainly hasn't been hot recently here. This is the only error task on the PC's history and this has been a very stable system for its total uptime. I'll keep an eye on it and see if this is a pattern. I'll also have to check on the CPU temps to see if they remain steady or go through spikes. I don't think heat is an issue though unless the card is just faulty. It has done 2 tasks successfully since this error. I also see an extra task in the In Progress list that is not on the system, so I know there will be another error task on the list that will read Timed Out after the 12th.
https://www.gpugrid.net/result.php?resultid=15586143
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org |
|
|
|
I just had this error on a fairly new system.
...
https://www.gpugrid.net/result.php?resultid=15586143
Here's an excerpt from the task's stderr.txt:
# GPU 0 : 58C
# GPU [GeForce GT 730] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GT 730 Note the missing
# BOINC suspending at user request (exit) (or similar) message explaining the reason of task shutdown between line 1 and 2. This is the sign of a dirty task shutdown. It's cause is unknown, but it could be a dirty system shutdown, or an unattended (automatic) driver update by Windows Update or NVidia Update. |
|
|
|
Perhaps this was a power loss. We have had 2 in the past few weeks. I just think this is the first time I have seen this particular error and when I looked it up, it brought me to this thread. |
|
|
|
In case this hasn't been resolved, I've also run into GPUGRID tasks erring and found my solution to be increasing the virtual memory size. See here. |
|
|
|
In case this hasn't been resolved, I've also run into GPUGRID tasks erring and found my solution to be increasing the virtual memory size. See here. I don't think that increasing virtual memory size could fix such problems (perhaps indirectly by accident). Your PC has 32GB RAM. I can't imagine that even if you run 12 CPU tasks simultaneously it will run out of 32GB (+1GB virtual). If it does, then some of the apps you run have a memory leak, and it will run out even if you set +4GB or +8GB virtual memory.
These
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1965. andSWAN : FATAL Unable to load module .mshake_kernel.cu. (719) errors are the result of the GPUGrid task gets suspended too frequently and / or too many times (or a failing GPU).
EDIT: SLI is not recommended for GPU crunching. You should try to turn it off for a test period (even remove the SLI bridge). |
|
|
|
You seem to be correct. I've since had a few more instances of the display driver crashing during crunching and trying to reinitialize the displays. Increasing the virtual memory paging file seemed to alleviate the issue, but not completely fix it.
I will give disabling SLI and then removing the bridge to see if that helps fix the issue once and for all.
Thanks for the input! |
|
|
Zalster Send message
Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level
Scientific publications
|
I just got this error on one of my machines.
I had just done a clean install of the drivers on Friday. System has been running for 2 days before this happen.
Can't be the temps, they are all low 40s to mid 40s C
Virtual memory was 451.74, physical memory is 173.16
Going to see if the next person is able to process it without any problem.
I read somewhere that it might be kernal panic, caught in a never ending loop.
Guess will have to wait and see. For now, I've removed the host and running other projects to test the stability of the system
____________
|
|
|
|
I just got this error on one of my machines.
I had just done a clean install of the drivers on Friday. System has been running for 2 days before this happen.
Can't be the temps, they are all low 40s to mid 40s C
Virtual memory was 451.74, physical memory is 173.16
Going to see if the next person is able to process it without any problem.
I read somewhere that it might be kernal panic, caught in a never ending loop.
Guess will have to wait and see. For now, I've removed the host and running other projects to test the stability of the system
Is there an errored out WU that you can show us? |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,757,013 RAC: 10,860,147 Level
Scientific publications
|
I just got this error on one of my machines.
I had just done a clean install of the drivers on Friday. System has been running for 2 days before this happen.
Can't be the temps, they are all low 40s to mid 40s C
Virtual memory was 451.74, physical memory is 173.16
Going to see if the next person is able to process it without any problem.
I read somewhere that it might be kernal panic, caught in a never ending loop.
Guess will have to wait and see. For now, I've removed the host and running other projects to test the stability of the system
Is there an errored out WU that you can show us?
Looks like this one
http://www.gpugrid.net/result.php?resultid=16773843 |
|
|
Zalster Send message
Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level
Scientific publications
|
Yes that is the one.
Looks like the next person finished it without a problem. Don't know what happen with mine.
Computer is in a room with controlled temp, never gets over 74F. Not sure why it happen. Only 1 work unit per card. Full physical CPU core assigned. |
|
|