Message boards : Graphics cards (GPUs) : Errors
Author | Message |
---|---|
Hello, | |
ID: 41538 | Rating: 0 | rate: / Reply Quote | |
Here too.. | |
ID: 41540 | Rating: 0 | rate: / Reply Quote | |
Can you report which WU you had the problems with? yesterday I released a whole new batch of simulations and, althought they should be fine, a possibility of corruption exists. Thanks a lot guys! | |
ID: 41541 | Rating: 0 | rate: / Reply Quote | |
All my WU since yesterday have failed (both long and short queues): | |
ID: 41545 | Rating: 0 | rate: / Reply Quote | |
https://www.gpugrid.net/workunit.php?wuid=11087902 as example | |
ID: 41547 | Rating: 0 | rate: / Reply Quote | |
Errors. All. | |
ID: 41548 | Rating: 0 | rate: / Reply Quote | |
They seem to fall into two groups:
The cards are giving an "unstable" error, probably just overclocking/overheating.
| |
ID: 41549 | Rating: 0 | rate: / Reply Quote | |
After a load of "error units" on my main system I updated drivers and started crunching again ok. | |
ID: 41550 | Rating: 0 | rate: / Reply Quote | |
I've received a couple of reissued tasks which were previously failed on other hosts with the "(unknown error) - exit code -44 (0xffffffd4)". | |
ID: 41552 | Rating: 0 | rate: / Reply Quote | |
They seem to fall into two groups: It is still recommend to use driver 334.xx or newer, or is this info outdated? 335.28 was a very stable driverversion. But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h by some minutes including dl/ul over HSDPA whats not a very stable line sometimes (OutdoorModem crashing, Router crashing, Bad USB Wireconnection etc.). The extrem long units needed >2days, and sometimes it hang up something because this long duration, whats bad on unattended machines. "Extra Long Queue" +1 I think i come back end of November with a single 970 or 980 at home over the winter here and attacking the 1B Mark again ^^ Byebye Worldwide GPUGrid Place 43 ^^ In Austria i still have double points as the nearly inactive Place 2. So that should be a secure enough Place #1 until end Nov. :) ____________ DSKAG Austria Research Team: http://www.research.dskag.at | |
ID: 41553 | Rating: 0 | rate: / Reply Quote | |
They seem to fall into two groups: These drivers a bit old, as your hosts are using the CUDA6.0 client, however they should work fine. From my experience the latest driver (353.30) is stable, however I don't have GTX-5xx cards. But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too... I've looked into the stderr output of your tasks, and I came to the conclusion that your tasks on Host 150780 are failing because its GPU can't take that high clock frequencies. (probably you've reduced the memory clock already, but this or the GPU clock still has to be reduced) GPU: GeForce GTX 570
Device clock : 1500MHz (default: 1464MHz)
Memory clock : 1700MHz (default: 1900MHz) Task 14375512: # Simulation unstable. Flag 9 value 129
# Simulation unstable. Flag 10 value 129
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)
# Attempting restart (step 1875000)
...
Simulation unstable. Flag 10 value 129
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)
# Attempting restart (step 1885000)
Task 14373443 ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
17:24:33 (3980): called boinc_finish In your other host (117426), the GTX 580 and the GTX 560Ti is definitely overheating (sometimes reaching 90°C), so it is a miracle that the tasks on this card don't have "simulation became unstable" messages. Task 14367948: <core_client_version>7.4.36</core_client_version>
<![CDATA[
<stderr_txt>
# GPU [GeForce GTX 580] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 580
# ECC : Disabled
# Global mem : 3071MB
# Capability : 2.0
# PCI ID : 0000:01:00.0
# Device clock : 1520MHz
# Memory clock : 1700MHz
# Memory width : 384bit
# Driver version : r334_00 : 33528
# GPU 0 : 69C
# GPU 1 : 89C
# GPU 0 : 71C
# GPU 1 : 90C
# GPU 0 : 73C
# GPU 0 : 74C
# GPU 0 : 75C
# GPU 0 : 76C
# GPU 0 : 77C
# GPU 0 : 78C
# GPU 0 : 79C
# GPU 0 : 80C
# GPU 0 : 81C
# GPU 0 : 82C
# Time per step (avg over 3125000 steps): 8.338 ms
# Approximate elapsed time for entire WU: 26054.703 s
12:38:20 (4052): called boinc_finish
</stderr_txt>
]]> ... and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h... You are right about the GTX 5xx series getting old, as there are two newer GPU generations developed in the meantime. However they should still work here also, and as Einstein@home is working on them, it suggests that the power outage corrupted some files of the GPUGrid project or the driver on your host. You can eliminate these factors by resetting (or removing and re-attaching) the GPUGrid project on your host, and reinstalling / upgrading your drivers. | |
ID: 41554 | Rating: 0 | rate: / Reply Quote | |
My host finished a previously failed task again. | |
ID: 41555 | Rating: 0 | rate: / Reply Quote | |
All Work Units failed with computation error... | |
ID: 41558 | Rating: 0 | rate: / Reply Quote | |
I have had 176 tasks fail between 22nd and 24th. Have now removed these cards for now. | |
ID: 41559 | Rating: 0 | rate: / Reply Quote | |
Got some wu with many attemps. My host can´t take these at all, they failed in early stage. | |
ID: 41560 | Rating: 0 | rate: / Reply Quote | |
Update: | |
ID: 41561 | Rating: 0 | rate: / Reply Quote | |
Manage to complete in first task, same settings and drivers as before. Templimit at 73°C (User hardware error) unstable simulation (-97 message) GERALD's an issue for me in a hot - soaking humid environment during the last fortnight. I have 7 more days of forecasted 95F heat and +75F dewpoint (humidity) to contend with. The GPU at ~50C. Yesterday a WU flipped 30k/sec in at 1503MHz. The WU before failed two hours in. -1MHz offset core GPU on a following WU completed without error then after 7hr - a WU failed just now. Offset -1 again. Will try one more long - will switch to short NOELIA's if another GERALD fails. Are there any other GM204 owners at 1.5GHz in hot/humid conditions? DMM reading = 1.212V. Dewpoint is currently at 78F - tropical rainforest humidity levels. Even if a sea breeze happens the air is so saturated it makes no difference. This should be in same batch i think but no error even a bit higher clock and suspended few times. GERALD_FXCXCL12_LIG tolerates a bin or two (13/26MHz) less than NOELIA and GIANNI on my 970. GPU's are independent of the next. A 100 straight valid WU streak in <80F ambient can become a failed WU every 5 with the same overclock in +90F ambient. NOELIA_467x short or ETQ yet to fail in similar conditions on my GPU(s). Expect unstable sim (-97 error) or CUDA error() with overclocking in hot ambient and/or (dewpoint = +70F) very humid conditions (<50C core temperature readings). ACEMD app is extremely demanding even if 70% core usage (WDDM bottleneck). Crunching with out of box clocks or the GPU's reference boost will offer lesser chance of CUDA errors and unstable sims that error to a -97 message. When the ACEMD app is doing it's job: Overclocked WC/air systems without Summer air conditioning are more prone to errors. Hot and/or humid environments a nemesis to ACEMD stability when the GPU is overclocked mildly. | |
ID: 41581 | Rating: 0 | rate: / Reply Quote | |
My host saved a workunit again: | |
ID: 41614 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : Errors