Author |
Message |
Neil ASend message
Joined: 9 Oct 08 Posts: 50 Credit: 12,676,739 RAC: 0 Level
Scientific publications
|
Hello All,
I have been struggling with quite a few GPU WU failures over the past weeks and am not sure what they are. There are a number of failure scenarios, but one common one is included below.
I run a Q9550 quad core on a EVGA 790i Ultra mobo with 2x GTX 260 Core 216's with some overclocking applied. I have been successful for quite a while with the overclock... running around 650 Mhz and linked with Shader. What also hasn't worked for a while is the EVGA GPU Voltage Tuner..which they broke with the 182 series drivers and up. I have used it to help stabilize the card and GPU WU's in the past while it was working. I am currently running a 185.68 driver. Any thoughts on what I can do or check would be appreciated.
The C: drive file mentioned below is NOT on my hard drive so must have been compiled in with the GPU WU or Nvidia driver??
<core_client_version>6.6.23</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using CUDA device 1
# Device 0: "GeForce GTX 260"
# Clock rate: 1458000 kilohertz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 27
# Number of cores: 216
# Device 1: "GeForce GTX 260"
# Clock rate: 799200 kilohertz
# Total amount of global memory: 939261952 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO ERROR: cannot open file "restart.coor"
ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 104: cufftExecC2R (gridcalc3)
called boinc_finish
</stderr_txt>
]]>
Thanks.
Neil
____________
Crunching for the benefit of humanity and in memory of my dad and other family members. |
|
|
|
Do I understand you correctly: you used the EVGA tool to increase you GPU voltage to stabilize your OC? In that case you'll probably loose stability without the voltage bump. The maximum stable clock frequency of chips is approximately proportional to the voltage over small voltage ranges.
Another possible factor is temperature: here and I guess also in Canada summer's coming. The higher the temperature the smaller the maximum stable frequency will be.
So I suggest to back off you OC by a substantial margin and see if you're stable again. ~50MHz on the core should do the trick. You could also run some stability tests, maybe a 1h loop of 3D Mark 06 and / or FurMark.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
Neil ASend message
Joined: 9 Oct 08 Posts: 50 Credit: 12,676,739 RAC: 0 Level
Scientific publications
|
Thanks ET. I have backed off to 181.22 driver and started GPU voltage tuner and bumped up the default voltage about 50 mv. I am waiting for GPU WU's to download and I'll track my progress and report back.
What I am still interested in is what the heck is the c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu? part of the message.
Neil
____________
Crunching for the benefit of humanity and in memory of my dad and other family members. |
|
|
|
What I am still interested in is what the heck is the c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu? part of the message.
Neil
That's just for debugging purposes. When the app was compiled,some debug info was generated by the compiler to make it easier for the developer to see where exactly in the source code it crashed.
It simply says that the crash occurred in "CPME_cufft.cu" and that this file is located in "c:\cygwin\home\speechserver\gpumd2\src\pme\" on the developers machine, NOT yours.
Anyway, i'd be interested to hear why they use cygwin/gcc instead of VS to compile the app...
|
|
|
|
Anyway, i'd be interested to hear why they use cygwin/gcc instead of VS to compile the app...
Cheaper license ... |
|
|
Neil ASend message
Joined: 9 Oct 08 Posts: 50 Credit: 12,676,739 RAC: 0 Level
Scientific publications
|
I've backed off to a 181.xx driver and EVGA GPU Voltage Tuner works. I've tweaked the voltage up a little and things are looking promising. The last 5 or so work units completed successfully between my 2 GTX 260's.... I'll report again by the weekend. Looks like it was probably a GPU voltage issue.
____________
Crunching for the benefit of humanity and in memory of my dad and other family members. |
|
|
|
What's the temperature of your cards?
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
Scientific publications
|
I got this error in this wu.
ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 104: cufftExecC2R (gridcalc3)
And this got this error:
Cuda error: Kernel [fft_data_swizzle_in] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 44 : the launch timed out and was terminated.
And a third got this error:
ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 104: cufftExecC2R (gridcalc3)
They were run on a GTX260+. Would have been running 185.81 drivers. Cards aren't OC'ed. No idea about temperatures as they seem to have gotten rid of the fan control option from vtune. Cards seem happy crunching Seti cuda work.
Probably stuffed (beta) drivers, so i'll go back to 182.50 drivers.
____________
BOINC blog |
|
|
Neil ASend message
Joined: 9 Oct 08 Posts: 50 Credit: 12,676,739 RAC: 0 Level
Scientific publications
|
I've been very successful since backing off to 181.xx drivers and running EVGA voltage tuner. As MarkJ suggest above, I got these same errors before I downgraded my driver and upped by GPU card voltage slightly (about 50 mv). Now I'm running very well on that box. Mark, you might try an experiment and try the same thing.
I'll check card temperatures next time (2xGTX 260 Core 216 Superclocked running at around 665 Mhz), but they typically run in the high 60's to high 70's depending on temperature in the room which can vary quite a bit. I have the fans set on auto using Precision 1.7.1.
____________
Crunching for the benefit of humanity and in memory of my dad and other family members. |
|
|
|
Neil,
with temperatures in the high 70's I wouldn't want to increase GPU voltage. At mid 60's I, for myself, could justify it (but wouldn't do so myself). However, there is no hard limit in this range: it's simply the less the better. And my "threashold temperatures" are purely subjective.. so your mileage can and likely will vary ;)
Mark,
recently you had really many errors with 0s cpu time, i.e. the WU did not even start. This points to a software problem. Since yesterday you seem to be going fine, did you change anything, e.g. downgrade the driver?
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|