Advanced search

Message boards : Graphics cards (GPUs) : Errors

Author Message
Profile PhilTheNet
Send message
Joined: 24 Sep 14
Posts: 1
Credit: 55,601,016
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwat
Message 41538 - Posted: 21 Jul 2015 | 5:50:00 UTC

Hello,
all the UT on error this morning (the project running smoothly on this computer for 6 months) :

Stderr output
<core_client_version>7.4.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -44 (0xffffffd4)
</message>
]]>

Someone has the same problem ???

Phil

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41540 - Posted: 21 Jul 2015 | 11:18:10 UTC
Last modified: 21 Jul 2015 | 11:25:39 UTC

Here too..

<core_client_version>7.4.36</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -44 (0xffffffd4)
</message>

I thought this was because we had a poweroutage and one machine recovered with wrong systemtime (+ 8days O.o). now i corrected that and ran into this errors. On both Graphiccards in only one longrunning24/7crunching machine. Voltage, Clocks are the same like before. Indeed i changed them to a more secure setting (i already ran them in a save way cloc and voltagewise) but it doesnt helped.

I still think it is a fault on my side on this machine but dont know where. Einstein running fine, but this doesnt say anything. On the other side, the crunching partner errored out too, so perhaps its an errorous batch on its way too. So i will try again in one week.
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Gerard
Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41541 - Posted: 21 Jul 2015 | 12:51:23 UTC
Last modified: 21 Jul 2015 | 12:51:45 UTC

Can you report which WU you had the problems with? yesterday I released a whole new batch of simulations and, althought they should be fine, a possibility of corruption exists. Thanks a lot guys!

lukeu
Send message
Joined: 14 Oct 11
Posts: 31
Credit: 81,420,504
RAC: 34,415
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41545 - Posted: 22 Jul 2015 | 3:58:07 UTC - in response to Message 41541.

All my WU since yesterday have failed (both long and short queues):

http://www.gpugrid.net/results.php?userid=81842

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41547 - Posted: 22 Jul 2015 | 10:06:10 UTC

https://www.gpugrid.net/workunit.php?wuid=11087902 as example
____________
DSKAG Austria Research Team: http://www.research.dskag.at



hawker
Send message
Joined: 28 Jun 10
Posts: 1
Credit: 31,454,680
RAC: 0
Level
Val
Scientific publications
watwatwat
Message 41548 - Posted: 22 Jul 2015 | 15:07:55 UTC

Errors. All.

https://www.gpugrid.net/results.php?userid=62470&offset=0&show_names=0&state=5&appid=

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41549 - Posted: 22 Jul 2015 | 15:39:46 UTC
Last modified: 22 Jul 2015 | 15:58:49 UTC

They seem to fall into two groups:

    The drivers are too old (e.g., 335.28) which gives the exit code -44 error
    The cards are giving an "unstable" error, probably just overclocking/overheating.


Those problems should be easy to fix, though I don't know whether XP has recent enough drivers to work.
https://www.gpugrid.net/results.php?hostid=223541&offset=0&show_names=1&state=0&appid=
https://www.gpugrid.net/results.php?hostid=194224&offset=0&show_names=1&state=0&appid=

TheFiend
Send message
Joined: 26 Aug 11
Posts: 99
Credit: 2,500,112,138
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41550 - Posted: 22 Jul 2015 | 18:17:21 UTC

After a load of "error units" on my main system I updated drivers and started crunching again ok.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41552 - Posted: 22 Jul 2015 | 22:46:24 UTC

I've received a couple of reissued tasks which were previously failed on other hosts with the "(unknown error) - exit code -44 (0xffffffd4)".
All of them completed successfully on my host.
For example:
http://www.gpugrid.net/workunit.php?wuid=11081985
http://www.gpugrid.net/workunit.php?wuid=11087547
http://www.gpugrid.net/workunit.php?wuid=11088391
http://www.gpugrid.net/workunit.php?wuid=11088528

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41553 - Posted: 23 Jul 2015 | 9:20:29 UTC - in response to Message 41549.
Last modified: 23 Jul 2015 | 9:37:20 UTC

They seem to fall into two groups:
[list]The drivers are too old (e.g., 335.28) which gives the exit code -44 error


It is still recommend to use driver 334.xx or newer, or is this info outdated? 335.28 was a very stable driverversion.

But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h by some minutes including dl/ul over HSDPA whats not a very stable line sometimes (OutdoorModem crashing, Router crashing, Bad USB Wireconnection etc.). The extrem long units needed >2days, and sometimes it hang up something because this long duration, whats bad on unattended machines. "Extra Long Queue" +1 I think i come back end of November with a single 970 or 980 at home over the winter here and attacking the 1B Mark again ^^

Byebye Worldwide GPUGrid Place 43 ^^ In Austria i still have double points as the nearly inactive Place 2. So that should be a secure enough Place #1 until end Nov. :)
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41554 - Posted: 23 Jul 2015 | 12:23:28 UTC - in response to Message 41553.
Last modified: 23 Jul 2015 | 12:27:23 UTC

They seem to fall into two groups:
[list]The drivers are too old (e.g., 335.28) which gives the exit code -44 error

It is still recommend to use driver 334.xx or newer, or is this info outdated? 335.28 was a very stable driverversion.

These drivers a bit old, as your hosts are using the CUDA6.0 client, however they should work fine.
From my experience the latest driver (353.30) is stable, however I don't have GTX-5xx cards.

But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too...

I've looked into the stderr output of your tasks, and I came to the conclusion that your tasks on Host 150780 are failing because its GPU can't take that high clock frequencies. (probably you've reduced the memory clock already, but this or the GPU clock still has to be reduced)
GPU: GeForce GTX 570 Device clock : 1500MHz (default: 1464MHz) Memory clock : 1700MHz (default: 1900MHz)

Task 14375512:
# Simulation unstable. Flag 9 value 129 # Simulation unstable. Flag 10 value 129 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 1875000) ... Simulation unstable. Flag 10 value 129 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 1885000)

Task 14373443
ERROR: file force.cpp line 513: TCL evaluation of [calcforces] 17:24:33 (3980): called boinc_finish

In your other host (117426), the GTX 580 and the GTX 560Ti is definitely overheating (sometimes reaching 90°C), so it is a miracle that the tasks on this card don't have "simulation became unstable" messages.
Task 14367948:
<core_client_version>7.4.36</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 580] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 580 # ECC : Disabled # Global mem : 3071MB # Capability : 2.0 # PCI ID : 0000:01:00.0 # Device clock : 1520MHz # Memory clock : 1700MHz # Memory width : 384bit # Driver version : r334_00 : 33528 # GPU 0 : 69C # GPU 1 : 89C # GPU 0 : 71C # GPU 1 : 90C # GPU 0 : 73C # GPU 0 : 74C # GPU 0 : 75C # GPU 0 : 76C # GPU 0 : 77C # GPU 0 : 78C # GPU 0 : 79C # GPU 0 : 80C # GPU 0 : 81C # GPU 0 : 82C # Time per step (avg over 3125000 steps): 8.338 ms # Approximate elapsed time for entire WU: 26054.703 s 12:38:20 (4052): called boinc_finish </stderr_txt> ]]>


... and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h...

You are right about the GTX 5xx series getting old, as there are two newer GPU generations developed in the meantime. However they should still work here also, and as Einstein@home is working on them, it suggests that the power outage corrupted some files of the GPUGrid project or the driver on your host. You can eliminate these factors by resetting (or removing and re-attaching) the GPUGrid project on your host, and reinstalling / upgrading your drivers.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41555 - Posted: 23 Jul 2015 | 12:43:28 UTC

My host finished a previously failed task again.
It has been failed three times on other hosts before:
Host 204947: Task 14385448 "# The simulation has become unstable. Terminating to avoid lock-up (1)" Tasklist (all failed)
Host 194523: Task 14399453 "unknown error) - exit code -44 (0xffffffd4)" Tasklist (all failed)
Host 163989: Task 14399778 "process exited with code 212 (0xd4, -44)" Tasklist (all failed)

bormolino
Send message
Joined: 16 May 13
Posts: 41
Credit: 88,126,864
RAC: 4,951
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 41558 - Posted: 25 Jul 2015 | 21:49:44 UTC
Last modified: 25 Jul 2015 | 21:50:09 UTC

All Work Units failed with computation error...

https://www.gpugrid.net/results.php?hostid=182555

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 41559 - Posted: 26 Jul 2015 | 2:07:06 UTC

I have had 176 tasks fail between 22nd and 24th. Have now removed these cards for now.
The failures are on 4 identical cards I have successfully used on GPUgrid for almost a year.
I have 1 remaining card (that is different) still able to run tasks.

It appears that something has changed in the last 'batch'

the four cards are:
https://www.gpugrid.net/results.php?hostid=181299
https://www.gpugrid.net/results.php?hostid=181300
https://www.gpugrid.net/results.php?hostid=180572
https://www.gpugrid.net/results.php?hostid=180015

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 24,000,702,249
RAC: 12,961,970
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 41560 - Posted: 26 Jul 2015 | 19:18:39 UTC
Last modified: 26 Jul 2015 | 19:19:21 UTC

Got some wu with many attemps. My host can´t take these at all, they failed in early stage.

https://www.gpugrid.net/workunit.php?wuid=11092476
created 20 Jul 2015 | 18:29:55 UTC
-97 (0xffffffffffffff9f) Unknown error number
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 40000)
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]

11092127
created 20 Jul 2015 | 18:17:40 UTC
Exit status -97 (0xffffffffffffff9f) Unknown error number
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 760000)
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]

11091745
created 20 Jul 2015 | 18:04:57 UTC
Exit status -97 (0xffffffffffffff9f) Unknown error number
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 600000)
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]

11098488
created 23 Jul 2015 | 18:13:11 UTC
Exit status -98 (0xffffffffffffff9e) Unknown error number
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
16:05:42 (7636): called boinc_finish

11091450
created 20 Jul 2015 | 17:55:40 UTC
Exit status -97 (0xffffffffffffff9f) Unknown error number
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 560000)
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]

Last completed and valid, but some host got problem with these

11091018
created 20 Jul 2015 | 17:42:09 UTC

11090161
created 20 Jul 2015 | 17:15:09 UTC

11090182
created 20 Jul 2015 | 17:15:53 UTC

11089720
created 20 Jul 2015 | 17:00:31 UTC

11089778
created 20 Jul 2015 | 17:02:31 UTC

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 24,000,702,249
RAC: 12,961,970
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 41561 - Posted: 27 Jul 2015 | 2:00:02 UTC
Last modified: 27 Jul 2015 | 2:04:20 UTC

Update:
https://www.gpugrid.net/workunit.php?wuid=11092831
created 20 Jul 2015 | 18:43:49 UTC

Manage to complete in first task, same settings and drivers as before. Templimit at 73°C
This should be in same batch i think but no error even a bit higher clock and suspended few times.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41581 - Posted: 28 Jul 2015 | 15:56:50 UTC

Manage to complete in first task, same settings and drivers as before. Templimit at 73°C

(User hardware error) unstable simulation (-97 message) GERALD's an issue for me in a hot - soaking humid environment during the last fortnight. I have 7 more days of forecasted 95F heat and +75F dewpoint (humidity) to contend with. The GPU at ~50C.

Yesterday a WU flipped 30k/sec in at 1503MHz. The WU before failed two hours in. -1MHz offset core GPU on a following WU completed without error then after 7hr - a WU failed just now. Offset -1 again. Will try one more long - will switch to short NOELIA's if another GERALD fails. Are there any other GM204 owners at 1.5GHz in hot/humid conditions? DMM reading = 1.212V. Dewpoint is currently at 78F - tropical rainforest humidity levels. Even if a sea breeze happens the air is so saturated it makes no difference.

This should be in same batch i think but no error even a bit higher clock and suspended few times.

GERALD_FXCXCL12_LIG tolerates a bin or two (13/26MHz) less than NOELIA and GIANNI on my 970. GPU's are independent of the next. A 100 straight valid WU streak in <80F ambient can become a failed WU every 5 with the same overclock in +90F ambient. NOELIA_467x short or ETQ yet to fail in similar conditions on my GPU(s).

Expect unstable sim (-97 error) or CUDA error() with overclocking in hot ambient and/or (dewpoint = +70F) very humid conditions (<50C core temperature readings). ACEMD app is extremely demanding even if 70% core usage (WDDM bottleneck). Crunching with out of box clocks or the GPU's reference boost will offer lesser chance of CUDA errors and unstable sims that error to a -97 message. When the ACEMD app is doing it's job: Overclocked WC/air systems without Summer air conditioning are more prone to errors. Hot and/or humid environments a nemesis to ACEMD stability when the GPU is overclocked mildly.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41614 - Posted: 3 Aug 2015 | 9:13:31 UTC

My host saved a workunit again:
e1s17_8-GERARD_FXCXCL12_LIG_6121521-0-1-RND4507
It was the last (7th) attempt to crunch it.

To avoid errors, please update your NVidia drivers to the latest one (v353.62).

http://www.geforce.com
http://www.nvidia.com

Could the staff please check if there's a consistency between failing tasks, and the assigned application version, as I think the CUDA6.0 application is more prone to errors lately.

Post to thread

Message boards : Graphics cards (GPUs) : Errors

//