Advanced search

Message boards : Graphics cards (GPUs) : 6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]

Author Message
Profile dataman
Avatar
Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9384 - Posted: 6 May 2009 | 19:07:45 UTC

Everything has been running well but had 6 errors today across 3 diffrent cards (9800GT's)

1 of these:

ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 84: cufftExecC2C (gridCalc2.2)
]]>

1 of these:

Cuda error: Kernel [shake_step_2] failed in file 'shake.cu' in line 128 : unknown error.

4 of these:

Cuda error: Kernel [PmeRealSpace_compute_forces] failed in file 'PmeRealSpace.cu' in line 172 : unknown error.

What's going on?
____________

palmss
Send message
Joined: 28 Aug 08
Posts: 7
Credit: 60,897,550
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9385 - Posted: 6 May 2009 | 19:13:44 UTC

I have a "PmeRealSpace" error too, with a 8800GT here http://www.gpugrid.net/result.php?resultid=631932

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 387,028,788
RAC: 1,197,795
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9391 - Posted: 6 May 2009 | 19:55:41 UTC

Same here, meRealSpace error, running an 8800GT. "IBUCH_KID" WU's. Do I see a pattern forming, or just a coincidence?

Error WU 634715

[boinc.at] Nowi
Send message
Joined: 4 Sep 08
Posts: 44
Credit: 3,685,033
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 9400 - Posted: 6 May 2009 | 21:25:38 UTC - in response to Message 9391.

I have the same error on three WU. GPU is a 8800GT....

Profile dataman
Avatar
Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9404 - Posted: 6 May 2009 | 22:49:01 UTC

Cuda error: Kernel [fft_data_swizzle_in] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 44 : unknown error.

More errors ... :(
____________

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9405 - Posted: 6 May 2009 | 22:54:55 UTC - in response to Message 9400.

I had three go quickly one after the other in a 40 mins period today on a 9800GTX+ errors were similar to the above:

Two were the same:
Cuda error: Kernel [shake_step_1] failed in file 'shake.cu' in line 79

The third was:
Cuda error: Kernel [PmeRealSpace_compute_forces] failed in file 'PmeRealSpace.cu' in line 172 : unknown error.

Had a replacement running for about three hours - no problems so far, see what we shall see in the morning :)

Regards
Zy

schizo1988
Send message
Joined: 16 Dec 08
Posts: 16
Credit: 10,644,256
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 9414 - Posted: 7 May 2009 | 1:59:15 UTC - in response to Message 9405.

I have a thread about failed jobs as well, one machine lost 5 jobs and I thought it was machine specific but then one of my other machines got the same error, and had some that were valid but listed warnings messages that seem related to the actual errors, but this is after it finished but a real time system would be impossible not to mention useless unless you could sit and monitor your apps 24/7. they have come out with quite a few new software updates and problems can always arise, and not making it manditory to use the new version would not work either. If we post the errors and make the people who actually understand the software aware of errors I have found this site to be about the best for getting help when you do encounter any type of problem.

loki126
Send message
Joined: 18 Nov 08
Posts: 14
Credit: 30,687,791
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9415 - Posted: 7 May 2009 | 4:11:56 UTC

Same here. Its the new 7000 Credit WU´s, IBUCH_KID_shao.
Here the failed tasks: 1 and 2

I guess they dont get along well with OC:

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 387,028,788
RAC: 1,197,795
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9416 - Posted: 7 May 2009 | 4:11:59 UTC
Last modified: 7 May 2009 | 4:19:42 UTC

I really think there is some issue related to "IBUCH_KID" and "KASHIF_HIVPR" WU's. I have had 4 errors today and those have also errored out for other users.

My Tasks


Error tasks:

KASHIF_HIVPR

IBUCH_KID

IBUCH_KID

IBUCH_KID



<edit>

I've turn back clocks to stock to see if that matters. I've had them OC'd for 8 months, but we'll see if the new WU's are more sensitive.

</edit>

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9418 - Posted: 7 May 2009 | 6:47:32 UTC - in response to Message 9416.

I really think there is some issue related to "IBUCH_KID" and "KASHIF_HIVPR" WU's. I have had 4 errors today and those have also errored out for other users.

My Tasks


Error tasks:

KASHIF_HIVPR

IBUCH_KID

IBUCH_KID

IBUCH_KID



<edit>

I've turn back clocks to stock to see if that matters. I've had them OC'd for 8 months, but we'll see if the new WU's are more sensitive.

</edit>



I have had error with this series[IBUCH KID] of work units also. My cards run stock. Same cards seem to run the HIV ones OK.
____________
mike

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9419 - Posted: 7 May 2009 | 7:32:41 UTC - in response to Message 9418.

Another one last night

ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 84: cufftExecC2C (gridCalc2.2)

There is an issue lurking somewhere with these WUs.

For me it started when the new ones with the Amber facility came out, shortlky after the failures started.

I am trying one more - if that fails, I stop until this is resolved

Zy

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9420 - Posted: 7 May 2009 | 7:50:32 UTC

There can be bad "batches" or tasks within a batch that are just plain bad. The good news such as it is, is that here at GPU Grid the tasks tend to die fairly quickly. I will note that they have just changed and are using some new tool and this may be part of the problem.

I have seen similar issues in other projects where a change in direction can lead to significant issues with tasks failing. Rosetta when they went in the direction of starting up the effort on Mini-Rosetta caused me to leave the project for a long time as far as major support because so many tasks failed. Now they have most of the bugs out and I am back again.

Keep reporting the bad tasks and I am sure they will figure it out ...

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9421 - Posted: 7 May 2009 | 8:21:49 UTC - in response to Message 9415.

Same here. Its the new 7000 Credit WU´s, IBUCH_KID_shao.
Here the failed tasks: 1 and 2

I guess they dont get along well with OC:


I had a similar issue. It went away when I went back to 182.50 drivers. You seem to be running beta drivers.
____________
BOINC blog

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9422 - Posted: 7 May 2009 | 8:22:28 UTC - in response to Message 9420.

I got a bunch of errors also and was wondering if we add system specs (including driver version) wold it help narrow down were the real issue is?

i7-920 HT, 4 GHz on P6T
Corsair Dominator 1600 2Gx3
EVGA GTX 295 (626/1496/1036) 185.81
Corsair TX750W, WD Caviar Black 1TB
Cool Master HAF 932
Xigmatek Dark Knight-S1283V
BOINC 6.6.20 for WCG + GPUGrid 24/7/365

Steve

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9423 - Posted: 7 May 2009 | 8:23:54 UTC - in response to Message 9404.

Cuda error: Kernel [fft_data_swizzle_in] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 44 : unknown error.

More errors ... :(


If you have beta drivers installed (your computers are hidden so I can't look) try the 182.50 drivers.
____________
BOINC blog

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9424 - Posted: 7 May 2009 | 9:04:14 UTC - in response to Message 9423.

On the new IBUCH_KID batch errors...
They don't fail completely, but the error rate is apparently higher.

We are stopping them for safety at the moment.

thanks for your patience,
ignasi

Profile Bender10
Avatar
Send message
Joined: 3 Dec 07
Posts: 167
Credit: 8,368,897
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9426 - Posted: 7 May 2009 | 10:27:54 UTC - in response to Message 9422.

Yes Steve WCG,

Posting the specs (driver ver, boinc ver, gpu, gpu overclock, os), help to narrow down where your issue may be.

But 'un-hiding' your computers so the MODS can look at your output files also helps (they may ask for this sometimes), when you have a problem. That and enabling 'debugging' if you have a pesky problem...
____________


Consciousness: That annoying time between naps......

Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9431 - Posted: 7 May 2009 | 13:25:14 UTC - in response to Message 9426.

Specs including versions are in my sig. I will also try to provide more specifics when I post about errors but it sounds like this round is semi-global so I doubt they need any more info at this time. If mods want details of my logs all they need to do is ask and I will "unhide". Interesting way to phrase that ... I prefer to think of it as "Public" or "Private" and in general I like to keep "Private" as much is possible.
____________
Thanks - Steve

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9432 - Posted: 7 May 2009 | 13:33:09 UTC - in response to Message 9431.

Specs including versions are in my sig. I will also try to provide more specifics when I post about errors but it sounds like this round is semi-global so I doubt they need any more info at this time. If mods want details of my logs all they need to do is ask and I will "unhide". Interesting way to phrase that ... I prefer to think of it as "Public" or "Private" and in general I like to keep "Private" as much is possible.



I'll show mine if you'll show me yours:D
____________
mike

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9433 - Posted: 7 May 2009 | 13:43:26 UTC - in response to Message 9420.

Keep reporting the bad tasks and I am sure they will figure it out ...

Absolutely - am totally behind them in trying to find out whats wrong, it could be at my end, I dont know. Its no good just pumping out errored ones though, there is only so many they need to track an issue. Meanwhile by stopping for a while I can put the hardware through proper testing, just to eliminate that side of the equation.

Having said all that, at present the one I started this morning still running fine, 63% done, which given the others that failed on mine, is illogical on the face of it.

Regards
Zy

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9435 - Posted: 7 May 2009 | 14:33:09 UTC

My first 2 errors ever AFAIK, the 1st a 76-KASHIF_HIVPR WU and the 2nd one of the infamous 76-IBUCH_KID WUs.
Two different cards, both 9600 GSO. Notice a similarity in the error messages?:

<core_client_version>6.6.24</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9600 GSO"
# Clock rate: 1674000 kilohertz
# Total amount of global memory: 402325504 bytes
# Number of multiprocessors: 12
# Number of cores: 96
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 50: cufftExecC2C (gridcalc2.1)
called boinc_finish

</stderr_txt>
]]>



<core_client_version>6.6.20</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9600 GSO"
# Clock rate: 1458000 kilohertz
# Total amount of global memory: 804978688 bytes
# Number of multiprocessors: 12
# Number of cores: 96
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
ERROR: c:\cy
</stderr_txt>
]]>


Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9444 - Posted: 7 May 2009 | 18:14:11 UTC - in response to Message 9435.

Got one through ok, then the next went bang after 30 mins.

Successful one was:
http://www.gpugrid.net/result.php?resultid=636960 A GIANNI

The one that failed this time - a KASHIF_HIVPR
http://www.gpugrid.net/result.php?resultid=639025
ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 104: cufftExecC2R (gridcalc3)

With this one I was at the PC when it went. There was a system warning popup message, didnt get it word for word, only saw a flash as it disappeared , " something something could not be contacted, video driver restarted", dont hang your hat off that word for word, but essentially it looks as though the Video Driver lost connection, and the system auto restarted the video driver, when it did that, instant computation error.

I will ferret in the log files, I have the PC logged to death, hopefully I can dig something up about it.

Two more downloaded, A GIANNI and a KASHIF, I suspended the GIANNI, and will try another KASHIF, see what happens.

Regards
Zy

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9446 - Posted: 7 May 2009 | 19:22:20 UTC - in response to Message 9444.
Last modified: 7 May 2009 | 19:31:33 UTC

The KASHIF lasted 37 mins and went bang. A GIANNI is now running
The failed KASHIF: http://www.gpugrid.net/result.php?resultid=640997
Error was: Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 94 : unknown error.

(Not seen a "swizzle_out" error before)

Started this one - a GIANNI - and on past performance it will probably go through ok:
http://www.gpugrid.net/result.php?resultid=641393

[Edit] Any debuging switch or log file - whatever - that I can enable this end that will help, please let me know and I will. If you want me to run a series of suspect ones (etc) let me know how, I will [/Edit]

Regards
Zy

[boinc.at] Nowi
Send message
Joined: 4 Sep 08
Posts: 44
Credit: 3,685,033
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 9451 - Posted: 7 May 2009 | 20:59:37 UTC

I have gotten another error of a 2-KASHIF_HIVPR-WU (result). The error appeared after more than 16 hours of computation on a 8800GT. Now I have three errors in a row. In my opinion is this unacceptable!!!!!!

Profile (_KoDAk_)
Avatar
Send message
Joined: 18 Oct 08
Posts: 43
Credit: 6,924,807
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwat
Message 9459 - Posted: 8 May 2009 | 7:34:55 UTC
Last modified: 8 May 2009 | 7:35:47 UTC

boinc 6.6.24 x64

By KoDAkthebest
and some ERRORS (
http://www.gpugrid.net/results.php?hostid=31714
____________

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9461 - Posted: 8 May 2009 | 8:53:00 UTC - in response to Message 9459.

We are digging into these problems.

thanks,
ignasi

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9462 - Posted: 8 May 2009 | 9:12:01 UTC - in response to Message 9461.
Last modified: 8 May 2009 | 9:15:20 UTC

Hi Ignasi

I had a look at all my computation error ones this morning now that most have finally gone through. All the KASHIF one's when crunched by a 9800GTX+ or below go bang. If the wingman is a 260 inclusive and above, they go through. I am aware is a crude deduction on my part as I have a very limited overview of the problems, however it does now seem pretty solid that KASHIF's dont through on cards rated 9800GTX+ and below.

If thats starting to be the case, do you still want the cards of 9800GTX+ and below to run the KASHIF's? If you do, fine, I just hate running ones that will go bang as it only delays their crunching by cards that can do it.

If you dont, I can just abort a KASHIF if I spot one coming through.

Regards
Zy

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9464 - Posted: 8 May 2009 | 9:33:49 UTC - in response to Message 9462.
Last modified: 8 May 2009 | 9:53:02 UTC

I am right to say that all the problems are related to older cards, like 8800,9800 and so on?
Did anyone experience repeated failures on those workunits with a 260,275,295 or 285?

gdf

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9465 - Posted: 8 May 2009 | 9:47:32 UTC - in response to Message 9464.
Last modified: 8 May 2009 | 9:51:01 UTC

Additional to my post at 9444 above.

Just remembered, and its only a part of it - its real annoying that I only got a flash of it as it went away - the error message referred to a file "nv???????" it maybe a DLL reference, cant remember. NV is probably no stunning revelation, but there it is for what its worth. Whatever the final full name, the error message claimed it had "stopped", and the system had restarted it. Instatantly I had the WU go bang. All cpu based models for other projects I run, have been unaffected by all this whether during normal running or when the KASHIFs go bang.

I seem to remember another post about a week ago, where there was a suspicion voiced about the memory size possibly being too small for these. ie at present maybe it needs 1GB cards, and goes bang on 512mb cards?

Regards
Zy

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9467 - Posted: 8 May 2009 | 10:05:25 UTC - in response to Message 9464.

Just had another KASHIF go bang, it lasted 57 mins

http://www.gpugrid.net/result.php?resultid=643475

Error message:
Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 94 : unknown error.

swizzle_out is starting to be a common one for me.

Got to go out now and meet a Client, wont be back until around 4pm UTC.

Regards
Zy

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9468 - Posted: 8 May 2009 | 10:41:34 UTC

I have had random failures on all my cards[8800gt/9600gso/9800gt/gts250] except the gtx260-192/216.

Some fail in a short period others linger much longer.
____________
mike

SkyeHunter
Send message
Joined: 7 Mar 09
Posts: 12
Credit: 1,254,285
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9469 - Posted: 8 May 2009 | 11:02:27 UTC
Last modified: 8 May 2009 | 11:06:56 UTC

Yup, similar issue here.

Yesterday got a WU that got stuck at 18% on my 8800GT. No error messages though, the Boinc manager thought the process was still running but remained for at least 12 hours at the same progress...

Cancelled the WU manually and started another one 18 hours ago. Usually WU's tend to take little less than 13 hours, and the current one hasn't been reporting yet (nor a new WU got uploaded, I keep my queue very short...). Propbably this evening I will see a similar issue.

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9470 - Posted: 8 May 2009 | 12:09:45 UTC - in response to Message 9465.

Additional to my post at 9444 above.

Just remembered, and its only a part of it - its real annoying that I only got a flash of it as it went away - the error message referred to a file "nv???????" it maybe a DLL reference, cant remember. NV is probably no stunning revelation, but there it is for what its worth. Whatever the final full name, the error message claimed it had "stopped", and the system had restarted it. Instatantly I had the WU go bang. All cpu based models for other projects I run, have been unaffected by all this whether during normal running or when the KASHIFs go bang.

I seem to remember another post about a week ago, where there was a suspicion voiced about the memory size possibly being too small for these. ie at present maybe it needs 1GB cards, and goes bang on 512mb cards?

Regards
Zy


My GTS250's are only 512Mb and they seem to work with KASHIF wu. I did suggest the driver version as a culprit. I was having problems last week on my GTX260's and after uninstalling the driver (a 185 variant) and going back to 182.50 seemed to cure its problems.
____________
BOINC blog

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9471 - Posted: 8 May 2009 | 12:14:16 UTC - in response to Message 9469.

Yup, similar issue here.

Yesterday got a WU that got stuck at 18% on my 8800GT. No error messages though, the Boinc manager thought the process was still running but remained for at least 12 hours at the same progress...

Cancelled the WU manually and started another one 18 hours ago. Usually WU's tend to take little less than 13 hours, and the current one hasn't been reporting yet (nor a new WU got uploaded, I keep my queue very short...). Propbably this evening I will see a similar issue.


Ahh the "never ending wu" bug. What version of BOINC are you running? It seems to have been fixed in 6.6.23 onwards.
____________
BOINC blog

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9472 - Posted: 8 May 2009 | 12:24:45 UTC - in response to Message 9471.

See this thread also. I had hanging WUs using 6.6.17 and installing 6.6.23 didn't help. Installing Nvidia driver 185.85 fixed the hanging problem but haven't had a WU process successfully since (though may not be a driver issue - currently running a GIANNI WU and is at 67% and looking OK)

Profile dataman
Avatar
Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9473 - Posted: 8 May 2009 | 13:14:10 UTC - in response to Message 9464.

I am right to say that all the problems are related to older cards, like 8800,9800 and so on?
Did anyone experience repeated failures on those workunits with a 260,275,295 or 285?

gdf

I have 7 9800GT's and one 8800GT. All have experienced failures. I'm on 6.6.20 and 185.85. I'm shutting them down until this problem is fixed. Good Luck!
____________

[boinc.at] Nowi
Send message
Joined: 4 Sep 08
Posts: 44
Credit: 3,685,033
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 9474 - Posted: 8 May 2009 | 14:09:50 UTC - in response to Message 9468.

I have had random failures on all my cards[8800gt/9600gso/9800gt/gts250] except the gtx260-192/216.


All of this are GPU lower than G200. Maybe this is a clue.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9476 - Posted: 8 May 2009 | 14:48:13 UTC

I hate to be a wet blanket.

But my 9800GT has five (5) total successful runs on just page one of my task list so it is NOT the card unless related to memory as this card has 1M VRAM ...

I am using driver 182.50, so it may be THAT ... WIn XP Pro, 32-bit is the other variant that may be an issue. BOINC Version 6.5.0 ...

The 6.6x versions did have some scheduler problems from something in the teens at least to 6.6.22 ... 6.6.23 and later seems to have cured that issue.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9478 - Posted: 8 May 2009 | 15:53:18 UTC - in response to Message 9476.

Above I mentioned a file that was "stopped" and restarted at the same moment the WU went bang. I found the error message for it. I have no idea whether it means anything to the current problem, or what it means in itself ...... however, posted for completeness as it did happen at the exact moment the WU went bang. "nvlddmkm" was what I was struggling to remember on the system error message at the time the WU went bang.

The error message reads:

"The description for Event ID 4101 from source Display cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

nvlddmkm "


It was located in:
Event Viewer/Custom Views/Administrative Events
Source: display.

At the time it said it was "restarted" presumably referring to nvlddmkm - whatever that is :)

Regards
Zy

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9480 - Posted: 8 May 2009 | 16:24:22 UTC - in response to Message 9473.

I am right to say that all the problems are related to older cards, like 8800,9800 and so on?
Did anyone experience repeated failures on those workunits with a 260,275,295 or 285?

gdf

I have 7 9800GT's and one 8800GT. All have experienced failures. I'm on 6.6.20 and 185.85. I'm shutting them down until this problem is fixed. Good Luck!


I'll give it one more day, maybe two and I will do likewise.

I am very surprised at the admin/developers this time. Usually there is a little more input/concern shown.

Have I missed a thread from the project that explains what is happening and their concern??
____________
mike

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9481 - Posted: 8 May 2009 | 16:31:50 UTC
Last modified: 8 May 2009 | 16:38:38 UTC

I have had my fair share of those also and installed all latest drivers Win7 185.85 which include cuda 2.2 on this machine and boinc 6.6.28.
To my surprise i see now in boinc that my 9600 GT seems only be able todo cuda 1.0 instructions.
So maybe the errors created by these workunits are related to instruction which only can be performed by the newest 2x5 models.
Since non of them seem to have much errors on these units
But somehow i have had less problems with my machine since the latest drivers am installed, it runs kinda rock solid (only BF2 and gameguard games are an issue)
BUT i'll remind you guys everything i run is BETA so problems can occur.
That it runs almost without a problem on my machine is no garantee it will on yours.

I guess if you have a 2X5 card you probably will see a gain in processing speed if some of the cuda 2.2 intructions can or/and are implemented

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9482 - Posted: 8 May 2009 | 17:12:29 UTC - in response to Message 9481.

Some positives for comparison as the KASHIFs are going bang with me, I've left the hardware/software setup alone so there is fair comparison.

GIANNIs seem to run fine. I am 7hrs into a TONI_HIVPR, so touch wood that seems like it will go through, will finish in about 5/6 hours. I have a IBUCH_HIVPR lined up as the next to go.

Regards
Zy

naja002
Avatar
Send message
Joined: 25 Sep 08
Posts: 111
Credit: 10,352,599
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9484 - Posted: 8 May 2009 | 19:47:18 UTC
Last modified: 8 May 2009 | 20:00:00 UTC

I have aborted all:

KASHIF_HIVPR
and
IBUCH_KID

and will now continue to do so.

I have 5x 8800GS and 1x 8800GT--those WUs do not complete on my rigs and most of them hang. Yesterday I completed ONE WU instead of 9-11. 5K ppd instead of 50Kppd.

Was on 6.6.17, 3 rigs 185.26, 1 rig 182.50

As of last night all rigs are: 6.6.28 and 3x 185.26, 1x 185.85--seems to have helped some.

This is an "across the farm" thing for me now. Problems initially started on the dual gpu rigs, but now it's across the board....

My rigs are not hidden. The Phunam-PC is a new setup--the intial errors are from setup, OCing, etc. I understand those. The new ones are part of this mess.

Hoping it gets sorted out soon....

EDIT: I have kept 1 KASHIF_HIVPR that appears to be running ok on a single Gpu rig. However, 1st sign of trouble and it's history.....

Profile Aardvark
Avatar
Send message
Joined: 27 Nov 08
Posts: 28
Credit: 82,362,324
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9488 - Posted: 8 May 2009 | 21:46:09 UTC
Last modified: 8 May 2009 | 21:54:19 UTC

Likewise here, failures on

KASHIF_HIVPR
and
IBUCH_KID

Two different machines. One with 32 bit Vista, 8800GT (O/C), client 6.6.20 & 185.86 driver. The other with 64 bit Vista, 9800 GX2 (Not O/C), client 6.6.20 & 182.50 driver. I have now updated both drivers to 185.85, which is latest release.

SkyeHunter
Send message
Joined: 7 Mar 09
Posts: 12
Credit: 1,254,285
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9492 - Posted: 8 May 2009 | 22:58:30 UTC - in response to Message 9471.



Ahh the "never ending wu" bug. What version of BOINC are you running? It seems to have been fixed in 6.6.23 onwards.


Indeed, nice description of what happened here. Installed Boinc 6.5.0 and WU picked up nicely where it blocked ...

Although it was KASHIF WU, it apparently was the scheduler to blame ....

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9493 - Posted: 8 May 2009 | 23:31:56 UTC

Well again had a unit error out of 13 hours of work, and looks like the big gun machines run them all fine.
I can't go on like this i lost hundreds of hours of time and money for nothing.
For the time being i am also shutting down the gpugrid till this issue is solved.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9495 - Posted: 8 May 2009 | 23:50:11 UTC - in response to Message 9461.

I am aware that there is hard work going on re finding the cause/fix. If its possible that someone could timeout for 2 mins to advise us all whether you still want the KASHIFs run by lower based cards, I suspect it would help enourmously as we could then abort to leave them to the big guns knowing its not going to cause issues in the bug-finding, and we carry on with the other WUs.

At present it seems lots are shutting down from doing anything in the absense of any advice, understandably, but the other WUs seem ok.

Just a gentle suggestion ...

Regards
Zy

naja002
Avatar
Send message
Joined: 25 Sep 08
Posts: 111
Credit: 10,352,599
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9499 - Posted: 9 May 2009 | 4:27:01 UTC
Last modified: 9 May 2009 | 4:51:52 UTC

The last KASHIF_HIVPR did in fact error out.....No more for me. I'm just going to have to check my rigs 1-2x/day and send them back....


I am aware that there is hard work going on re finding the cause/fix. If its possible that someone could timeout for 2 mins to advise us all whether you still want the KASHIFs run by lower based cards, I suspect it would help enourmously as we could then abort to leave them to the big guns knowing its not going to cause issues in the bug-finding, and we carry on with the other WUs.

At present it seems lots are shutting down from doing anything in the absense of any advice, understandably, but the other WUs seem ok.

Just a gentle suggestion ...

Regards
Zy



My guess would be that they are still releasing them because they run on the higher end cards. They can still get the work completed. However, if that's the case, then I think the server needs to be setup to issue specific WU to specific cards. The server gets plenty of info from our rigs---so I don't see why that can't be done....

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9504 - Posted: 9 May 2009 | 7:23:36 UTC

Nothing will likely be done until sometime Monday, I am also at No New Work until problem is resolved.
____________
mike

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9506 - Posted: 9 May 2009 | 8:34:26 UTC - in response to Message 9504.

The real problem is that we do not understand why these WUs crash. There are several Kashif_XXX workunits and only a set of them does crash on some machines.

We will stop the crashing WUs as more testing did not really help.

gdf

Profile Bymark
Avatar
Send message
Joined: 23 Feb 09
Posts: 30
Credit: 5,897,921
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 9515 - Posted: 9 May 2009 | 10:34:36 UTC
Last modified: 9 May 2009 | 10:40:31 UTC

I have a big problem with my new asus 260:

hostid=35303

I downgraded all drivers, and now waiting to get more task.
"reached daily quota of 4 results" heh ;),
Any suggestion? Seti gpus working fine.......
____________
"Silakka"
Hello from Turku > Åbo.

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9521 - Posted: 9 May 2009 | 10:57:09 UTC
Last modified: 9 May 2009 | 11:06:44 UTC

Sadly yes the famous units which we discussing all over the forum

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9525 - Posted: 9 May 2009 | 11:28:04 UTC - in response to Message 9515.
Last modified: 9 May 2009 | 11:30:34 UTC

I have a big problem with my new asus 260:

hostid=35303

I downgraded all drivers, and now waiting to get more task.
"reached daily quota of 4 results" heh ;),
Any suggestion? Seti gpus working fine.......


The ones crashing on that machine are not the suspect WUs that they have now stopped issuing, those crashing on that machine usually run fine. He also has a 260 which is outside the problems, its the lower cards that did have issues in the past. Something else lurketh. No idea what personally, over to the Gurus for that.

Regards
Zy

Profile Sandro
Send message
Joined: 19 Aug 08
Posts: 22
Credit: 3,660,304
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9528 - Posted: 9 May 2009 | 11:59:27 UTC - in response to Message 9464.

I am right to say that all the problems are related to older cards, like 8800,9800 and so on?
Did anyone experience repeated failures on those workunits with a 260,275,295 or 285?

gdf

Yes. My GTX 260 running under 64bit Ubuntu also crashes WUs

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce GTX 260"
# Clock rate: 1242000 kilohertz
# Total amount of global memory: 938803200 bytes
# Number of multiprocessors: 27
# Number of cores: 216
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"

</stderr_txt>
]]>


exit status: 11 (0xb)
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce GTX 260"
# Clock rate: 1242000 kilohertz
# Total amount of global memory: 938803200 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO ERROR: cannot open file "restart.coor"

</stderr_txt>
]]>

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9530 - Posted: 9 May 2009 | 12:22:08 UTC
Last modified: 9 May 2009 | 12:33:49 UTC

Let's gather some of that information:

- all failures reported here affect G92 and G9x-class chips
- G200 usually runs them just fine
- there are some errors with G200 as well, but this could just be the normal error rate
- Pauls G92 runs fine (and hopefully others)

-> it's a bug which is triggered by a special client configuration

- BOINC 6.6.x, 6.5.0 and 6.4.7 are definitely affected -> the version likely doen't matter
- driver 185.8x, 185.6x and 182.50 are reported to be affected, but 182.50 for XP32 works for Paul

-> did anyone try older drivers? E.g. 182.08, which has a very solid track record

- Pauls card has 1 GB of memory, whereas most G92 cards have 512 MB or less

Do we have any other reports of G9x cards, which run these tasks fine? Could anyone check the memory consumption of these WUs with RivaTuner?

EDIT: only certain WUs of the "IBUCH_KID" and "KASHIF_HIVPR" series are affected. Do we know which ones? Are the ones which work for Pauls card by pure coincidence all of the type which works?

For example my 9800GTX+ 512MB on Vista 64, 185.66 and 6.5.0 finished:


  • 88-KASHIF_HIVPR_dim_ba2-2-100-RND8763_0
  • 7-KASHIF_HIVPR_mon_ba5-6-100-RND3602_1
  • 57-KASHIF_HIVPR_mon_ba4-4-100-RND1833_1


and failed


  • 79-KASHIF_HIVPR_n1_for_ba1-4-100-RND9984_0
  • 175-IBUCH_KID_shao_ba1-1-100-RND4198_2
  • 93-IBUCH_KID_shao_ba2-0-100-RND9546_1



MrS
____________
Scanning for our furry friends since Jan 2002

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9534 - Posted: 9 May 2009 | 13:09:50 UTC

I am on 6.4.5 and use either 177.82 or 180.22 on Ubuntu 64.

I have had many failures on all cards Except my 260's[192/216]
____________
mike

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9538 - Posted: 9 May 2009 | 13:39:04 UTC - in response to Message 9530.

Let's gather some of that information:

- all failures reported here affect G92 and G9x-class chips
- G200 usually runs them just fine
- there are some errors with G200 as well, but this could just be the normal error rate
- Pauls G92 runs fine (and hopefully others)

-> it's a bug which is triggered by a special client configuration

- BOINC 6.6.x, 6.5.0 and 6.4.7 are definitely affected -> the version likely doen't matter
- driver 185.8x, 185.6x and 182.50 are reported to be affected, but 182.50 for XP32 works for Paul

-> did anyone try older drivers? E.g. 182.08, which has a very solid track record

- Pauls card has 1 GB of memory, whereas most G92 cards have 512 MB or less

Do we have any other reports of G9x cards, which run these tasks fine? Could anyone check the memory consumption of these WUs with RivaTuner?

EDIT: only certain WUs of the "IBUCH_KID" and "KASHIF_HIVPR" series are affected. Do we know which ones? Are the ones which work for Pauls card by pure coincidence all of the type which works?

For example my 9800GTX+ 512MB on Vista 64, 185.66 and 6.5.0 finished:

  • 88-KASHIF_HIVPR_dim_ba2-2-100-RND8763_0
  • 7-KASHIF_HIVPR_mon_ba5-6-100-RND3602_1
  • 57-KASHIF_HIVPR_mon_ba4-4-100-RND1833_1


and failed


  • 79-KASHIF_HIVPR_n1_for_ba1-4-100-RND9984_0
  • 175-IBUCH_KID_shao_ba1-1-100-RND4198_2
  • 93-IBUCH_KID_shao_ba2-0-100-RND9546_1



MrS



I have 4 machines with GTS250's (512Mb). They are running under XP32 with 182.50 drivers and seem fine.

I have an i7 with dual GTX260's. It is running under XP32 with 182.50 drivers and also seems fine. I had problems a week ago with 185.xx (beta) drivers and uninstalled them before reinstalling 182.50 drivers. Problems seemed to go away after that.

All machines currently running BOINC 6.6.28.

I had one IBUCH_KID wu, which I aborted after seeing post from GDF regarding them being in error. KASHIF_HIVPR seem fine.
____________
BOINC blog

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9540 - Posted: 9 May 2009 | 13:45:59 UTC - in response to Message 9538.

Oh, so it also affects linux. MAybe it's not much point searching for windows and drivers versions then.

I had one IBUCH_KID wu, which I aborted after seeing post from GDF regarding them being in error. KASHIF_HIVPR seem fine.


Some WUs of both series are affected, but not on G200 based cards (GTX 2xx).

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9545 - Posted: 9 May 2009 | 14:15:43 UTC

Well, I just had a crash on the i7 67-KASHIF_HIVPR_n1_for_ba3-2-100-RND8737, this is a task that died at least twice before.

The thing is, I was playing a game at the time. Low intensity turn based strategy game. But, I cannot say if that had any effect. THe game seemed to die and the graphics driver crashed. That said, the other tasks in progress seemed to stay Ok ...

More interesting is that there were three different errors ...

Of course, the task was run on three different class cards.

And I am running BOINC 6.6.28 on that machine ... still 182.50 drivers though.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9549 - Posted: 9 May 2009 | 14:32:07 UTC - in response to Message 9545.
Last modified: 9 May 2009 | 14:34:33 UTC

I have been having a closer look at my errors , and a few from others. This bares some checking, but it appears on the face of it that the crashed ones do have a common element "signal 11". The "h-bond" message is a red herring to this. as it refers to the "Amber" processes (is that right ?), no matter the detail, it was cleared up in another thread as a non issue, just a text message re the internal processes in the WU, not its validity as a successful WU.

"Signal 11" does appear vertually every time from the ones I looked at. I am aware signal 11 is an issue way down in the Communication Layer - which in itself rings a bell considering the way current problems effects some cards and not others - some operating systems not others - but I have no idea of where to take that logic further, or even if indeed it has validity, I dont have that level of knowledge. Signal 11 I am aware can appear for many many reasons, and can be difficult to work out what the reason is, but if its the case this time, at least its the start down the right road.

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9553 - Posted: 9 May 2009 | 15:20:44 UTC

@Zydor: I don't see "signal 11", neither in my nor in your latest results.

@Paul: that's number 3 of these tasks which have failed on a G200 card. But the circumstances were slightly unusual.. not sure if it means anything.

@all: ouch, 2 more errors for me:

- "30-KASHIF_HIVPR_dim_ba3-4-100-RND0655_0" - seems "normal"
- "p2690000-IBUCH_pYIpYVkp01_0705-2-10-RND1281_1" - not normal

The second task registered only 3s cpu time, so it may have happened while the driver was still restarting.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Bymark
Avatar
Send message
Joined: 23 Feb 09
Posts: 30
Credit: 5,897,921
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 9559 - Posted: 9 May 2009 | 16:27:50 UTC - in response to Message 9525.
Last modified: 9 May 2009 | 16:50:40 UTC

I have a big problem with my new asus 260:

hostid=35303

I downgraded all drivers, and now waiting to get more task.
"reached daily quota of 4 results" heh ;),
Any suggestion? Seti gpus working fine.......


The ones crashing on that machine are not the suspect WUs that they have now stopped issuing, those crashing on that machine usually run fine. He also has a 260 which is outside the problems, its the lower cards that did have issues in the past. Something else lurketh. No idea what personally, over to the Gurus for that.

Regards
Zy


Now i have exactly the same drivers boinc etc. as my fine working ati 260. Still waiting for new wu's, seti is working fine, same power 550w all should be identical, maybe a hardware problem but then I don't understand why seti gpus working without failure. Runnig one seti Gpu:

Seti acount for same computer

Hardware monitor
-----------------------------------------------------

AMD Athlon 64 X2 5600+ hardware monitor

Temperature sensor 0 33°C (91°F) [0x149] (Core #0)
Temperature sensor 1 38°C (99°F) [0x15A] (Core #1)

Dump hardware monitor

Hardware monitor
-----------------------------------------------------

GeForce GTX 260 hardware monitor

Temperature sensor 0 71°C (159°F) [0x47] (GPU Core)
____________
"Silakka"
Hello from Turku > Åbo.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9560 - Posted: 9 May 2009 | 18:08:31 UTC - in response to Message 9559.

Well, you also got >6 errors a day, but your problem is totally unrelated to what is being discussed int his thread. Might help to ask in a separate thread, if you need further assistence. Do 3D Mark and/or Furmark run on your card? Seti stresses the hardware less than GPU-Grid.

MrS
____________
Scanning for our furry friends since Jan 2002

[AF] Profanateur
Avatar
Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9562 - Posted: 9 May 2009 | 19:03:31 UTC - in response to Message 9560.

And for my pbs ? with driver other than 182.5.

Profile Aardvark
Avatar
Send message
Joined: 27 Nov 08
Posts: 28
Credit: 82,362,324
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9567 - Posted: 10 May 2009 | 0:34:22 UTC

Success on 52-KASHIF_HIVPR_mon_ba3-7-100-RND3244_0. 64 bit Vista, 9800 GX2 (Not O/C), client 6.6.20 & 182.85 driver.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9578 - Posted: 10 May 2009 | 8:17:23 UTC

Aardvark, so far the "KASHIF_HIVPR_mon" have also been fine for my machine. Thanks for the info.. seems like these are indeed not the trouble makers.

Profanateur, if I remember correctly you have a separate thread regarding your problem elsewhere. And since on your machine all WUs error you are facing a different problems than what is discussed here. I think I wrote some suggestions in that other thread.. well, I hope. At least I wanted to write something ;)
What do you mean by pbs?

MrS
____________
Scanning for our furry friends since Jan 2002

[AF] Profanateur
Avatar
Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9581 - Posted: 10 May 2009 | 8:49:47 UTC

pbs =problems=failure.
sorry but I'm french.

boincwoman
Send message
Joined: 9 May 09
Posts: 1
Credit: 2,096,817
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwatwat
Message 9585 - Posted: 10 May 2009 | 11:31:08 UTC

I'm new here.

Have errors with this:

75-IBUCH_HIVPR_mon_ba8-4-100-RND5234 id: 451357

100-KASHIF_HIVPR_n1_for_ba4-4-100-RND3172 id: 448737

Shuttle XPC
Vista Enterprise 64 bit 2 Gb ram
AMD Opteron 2.4 GHz model 180
Geeforce 9400GT 1 Gb ram newly bought
Boinc 6.6.20

ComputerID: 35365

The Boincwoman

refla
Send message
Joined: 12 Feb 09
Posts: 9
Credit: 385,357
RAC: 0
Level

Scientific publications
watwatwatwat
Message 9586 - Posted: 10 May 2009 | 12:12:50 UTC - in response to Message 9530.
Last modified: 10 May 2009 | 12:15:06 UTC

xp/32 + [email protected] + BOINC6.4.5 cannot survive!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9588 - Posted: 10 May 2009 | 12:25:02 UTC - in response to Message 9586.

Refla, not sure what you mean. You only have successful WUs and others which are listed as "aborted by user". Sure, they can't survive if you abort them ;)

Boincwoman, your machine has not completed any WUs so far. So i'm not sure if we can attribute your failure of the "IBUCH_HIVPR" to the error discussed here. If your card is passively cooled it may be overheating (check with GPU-Z and report temperatures). Otherwise your setup should be fine.
However, the card is very slow: it has 16 shaders ("stream processors"), whereas at least 50 are officially recommended (FAQ). You'll have problems to meat the GPU-Grid deadlines and you may want to take a look at seti for your GPU.

MrS
____________
Scanning for our furry friends since Jan 2002

[AF] Profanateur
Avatar
Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9605 - Posted: 10 May 2009 | 20:18:45 UTC

Errors todays :

10/05/2009 10:53:19 GPUGRID Output file p1760000-IBUCH_pYIpYVkp01_0705-4-10-RND5135_0_1 for task p1760000-IBUCH_pYIpYVkp01_0705-4-10-RND5135_0 absent
10/05/2009 16:56:28 GPUGRID Output file p2750000-IBUCH_pYIpYVkp01_0705-4-10-RND5064_1_1 for task p2750000-IBUCH_pYIpYVkp01_0705-4-10-RND5064_1 absent

refla
Send message
Joined: 12 Feb 09
Posts: 9
Credit: 385,357
RAC: 0
Level

Scientific publications
watwatwatwat
Message 9608 - Posted: 10 May 2009 | 21:11:16 UTC - in response to Message 9588.

ETA:

I aborted them because WUs' progress has not advanced in a long time(at least more than 1 hour). The situation has not changed even I rebooted my computer.

After 2 WUs, I deem if the last number in the task name more than zero, it should be a bad WU.

Details in http://www.gpugrid.net/forum_thread.php?id=1041

My English is not good enough, I hope you can understand what I mean.

:)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9613 - Posted: 10 May 2009 | 21:48:37 UTC

Profanateur,

your problem is not related to what is being discussed here. Very many of your WUs error, this is different from the "KASHIF_HIVPR" and "IBUCH_KID" issue. You actually completed some, so your software should be fine.

However, you are running a very new driver and two overclocked cards, which are very different. All of these or their combination could lead to problems. I suggest you start a new thread (instead of posting a little in different threads), write down your current config (software versions, clocks, GPU temperatures) and then change some parameters, document the changes and see if it helps. By that I mean

- run only 1 of the cards to see if one is broken
- reduce all clocks to standard values
- run other stability tests
- try well-tested drivers like 182.50 or 182.08
- maybe more

If you do that we (or you yourself ;) should be able to get you going.

Regards,
MrS
____________
Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9615 - Posted: 10 May 2009 | 21:57:31 UTC - in response to Message 9608.

refla,

that's strange. You're running 6.4.5, so you shouldn't be affected by the slow-6.6.20-bug. Also most of your canceled WUs may belong to the critical "KASHIF_HIVPR" and "IBUCH_KID" series, but some were also "IBUCH_pYIpYVkp01", which have not been reported to fail massively.
Furthermore your WUs are crunched just fine on G200-based cards, whereas no G9x returned any of them. Sorry, don't know what this means..

MrS
____________
Scanning for our furry friends since Jan 2002

refla
Send message
Joined: 12 Feb 09
Posts: 9
Credit: 385,357
RAC: 0
Level

Scientific publications
watwatwatwat
Message 9627 - Posted: 11 May 2009 | 3:55:09 UTC - in response to Message 9615.

ETA,

please tell me how to avoid/recover the case that WU's progress freezes.

You can see not only me who met this case. Before I abandoned them, other GPUGriders have done the same operation.

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9630 - Posted: 11 May 2009 | 10:12:05 UTC - in response to Message 9627.

ETA,

please tell me how to avoid/recover the case that WU's progress freezes.

You can see not only me who met this case. Before I abandoned them, other GPUGriders have done the same operation.


@refla:

I would suggest you switch to BOINC 6.6.23.

Your driver version is not shown, but as ETA has said above I would suggest 182.50 drivers as they seem to be reliable.
____________
BOINC blog

palmss
Send message
Joined: 28 Aug 08
Posts: 7
Credit: 60,897,550
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9631 - Posted: 11 May 2009 | 10:41:28 UTC

Hi
I have another error(Kernel [nb_k] failed in file 'nb.cu' in line 202 : unknown error) on a new type of WU http://www.gpugrid.net/result.php?resultid=645509

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9632 - Posted: 11 May 2009 | 11:08:39 UTC - in response to Message 9631.

Hi
I have another error(Kernel [nb_k] failed in file 'nb.cu' in line 202 : unknown error) on a new type of WU http://www.gpugrid.net/result.php?resultid=645509


What driver version are you using?
____________
BOINC blog

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9633 - Posted: 11 May 2009 | 11:23:09 UTC
Last modified: 11 May 2009 | 11:23:33 UTC

Have the "EVIL" work units been disabled or deleted?

I have stopped work on 8[250's and below] of my cards. The two 260s are doing OK.
____________
mike

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9635 - Posted: 11 May 2009 | 12:25:39 UTC - in response to Message 9633.
Last modified: 11 May 2009 | 12:27:45 UTC

Yes they stopped issuing the suspect ones on Saturday, its not all KASHIF's that are suspect, there are several types of KASHIF WUs, it was only one particular type of KASHIF WU that was giving grief.

See http://www.gpugrid.net/forum_thread.php?id=1034&nowrap=true#9506

Regards
Zy

[AF] Profanateur
Avatar
Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9644 - Posted: 11 May 2009 | 17:20:39 UTC - in response to Message 9613.
Last modified: 11 May 2009 | 17:21:29 UTC

Profanateur,

your problem is not related to what is being discussed here. Very many of your WUs error, this is different from the "KASHIF_HIVPR" and "IBUCH_KID" issue. You actually completed some, so your software should be fine.

However, you are running a very new driver and two overclocked cards, which are very different. All of these or their combination could lead to problems. I suggest you start a new thread (instead of posting a little in different threads), write down your current config (software versions, clocks, GPU temperatures) and then change some parameters, document the changes and see if it helps. By that I mean

- run only 1 of the cards to see if one is broken
- reduce all clocks to standard values
- run other stability tests
- try well-tested drivers like 182.50 or 182.08
- maybe more

If you do that we (or you yourself ;) should be able to get you going.

Regards,
MrS

I have no errors with 182.50.
I said that from beginning.

Profile Bymark
Avatar
Send message
Joined: 23 Feb 09
Posts: 30
Credit: 5,897,921
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 9645 - Posted: 11 May 2009 | 18:11:52 UTC - in response to Message 9560.
Last modified: 11 May 2009 | 18:56:22 UTC

My solution on the 260 was Boinc 6.4.7 and driver 178.28, now working as a train.
Slow but getting faster, like a first mosquito this summer today in Turku Finland.

Thomas Bymark
____________
"Silakka"
Hello from Turku > Åbo.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9649 - Posted: 11 May 2009 | 20:48:02 UTC - in response to Message 9644.

Profanateur wrote:

I have no errors with 182.50.
I said that from beginning.


Actually you said "And for my pbs ? with driver other than 182.5." Which I understand as "I'm not interested in my problems with 182.50, only in the problems with other drivers".

Well, no. Actually when I read that post I thought something like "Isn't that the guy with many errors and the exotic setup? What does he want to say?" Now that I know I understand you.

So if you know 182.50 works, why don't you use it?

MrS
____________
Scanning for our furry friends since Jan 2002

[AF] Profanateur
Avatar
Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9653 - Posted: 11 May 2009 | 21:11:30 UTC

'cause I want last release to have Occlusion ambiant in game.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9654 - Posted: 11 May 2009 | 21:15:55 UTC - in response to Message 9653.

Then you'll be glad to hear about this ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Andrew
Send message
Joined: 9 Dec 08
Posts: 29
Credit: 18,754,468
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9655 - Posted: 11 May 2009 | 21:43:53 UTC
Last modified: 11 May 2009 | 21:47:27 UTC

I had 2 fail on my 8800GT, one on 5th May, and one right now. My screen actually went black for a few seconds and I briefly saw windows error reporting in process explorer! Driver version 182.50 I believe. Card was stock clocks at the time (fine).

5th May one was 159-IBUCH_KID_shao_ba1-0-100-RND5509_1:

and the one just now was 53-KASHIF_HIVPR_n1_for_ba1-2-100-RND0722_1:
which had the swizzle error others have described.

palmss
Send message
Joined: 28 Aug 08
Posts: 7
Credit: 60,897,550
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9658 - Posted: 11 May 2009 | 22:46:02 UTC - in response to Message 9632.

Hi
I have another error(Kernel [nb_k] failed in file 'nb.cu' in line 202 : unknown error) on a new type of WU http://www.gpugrid.net/result.php?resultid=645509


What driver version are you using?

I have the version 181.22 driver

refla
Send message
Joined: 12 Feb 09
Posts: 9
Credit: 385,357
RAC: 0
Level

Scientific publications
watwatwatwat
Message 9663 - Posted: 12 May 2009 | 4:37:40 UTC - in response to Message 9630.

ETA,

please tell me how to avoid/recover the case that WU's progress freezes.

You can see not only me who met this case. Before I abandoned them, other GPUGriders have done the same operation.


@refla:

I would suggest you switch to BOINC 6.6.23.

Your driver version is not shown, but as ETA has said above I would suggest 182.50 drivers as they seem to be reliable.


MarkJ:

Thanks, I will test it. :)

naja002
Avatar
Send message
Joined: 25 Sep 08
Posts: 111
Credit: 10,352,599
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9697 - Posted: 13 May 2009 | 1:32:41 UTC - in response to Message 9649.
Last modified: 13 May 2009 | 1:40:24 UTC




Well, no. Actually when I read that post I thought something like "Isn't that the guy with many errors and the exotic setup? What does he want to say?"


MrS



That may be me. If so, the many errors are from 2 sources: my fault and not my fault ;) Some of these WUs are a nightmare and I don't accept responsibility for that. However, I have had an issue or 3 on my end...those things I understand and accept responsibility for...;) The i7 upgrade produced a lot of initial errors, because of driver compatibility. I've produced 1 successful WU after another for long periods of time. When I start to develop issues--I try to sort it out and get it straight, but when the issues are really not on my end...there's not much that I can do except ride it out.


But I can say that I've used the 185.26 driver on 3 rigs (initially 4) for a month before all these issues arose. So, the issue is with the WUs being incompatible with the driver v. the Driver being incompatible with the WUs. In other words, any incompatibility change is in the WUs....not the driver. I cannot speak for any other version of 185.xx though...


Also, IBUCH_pYIpYVkp01_0705-4-10 seems to be another WU with issues, but I think that is already known....

HTH

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9745 - Posted: 14 May 2009 | 9:50:14 UTC - in response to Message 9697.



Also, IBUCH_pYIpYVkp01_0705-4-10 seems to be another WU with issues, but I think that is already known....

HTH


This runs fine actually. Are you referring to any error in particular?

ignasi

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9748 - Posted: 14 May 2009 | 11:23:34 UTC
Last modified: 14 May 2009 | 11:24:13 UTC

Just had a KASHIF go bang. Its appears to be the old hassles on the face of it - just highlighting it for the record due to recent hassles with some KASHIFs.

http://www.gpugrid.net/result.php?resultid=667592

It had been running for 11hrs15 so was different from the others I had go - they were early, this was late in processing, almost finished when it went. "One of those things" I suspect.

The network connection was down at the time, a major BT Network fault that had been extant for nearly 24 hrs, the latter should have had no affect, just mentioned for completeness as it was down when the WU went bang.

Regards
Zy

naja002
Avatar
Send message
Joined: 25 Sep 08
Posts: 111
Credit: 10,352,599
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9750 - Posted: 14 May 2009 | 12:53:20 UTC - in response to Message 9745.



Also, IBUCH_pYIpYVkp01_0705-4-10 seems to be another WU with issues, but I think that is already known....

HTH


This runs fine actually. Are you referring to any error in particular?

ignasi



p3400000-IBUCH_pYIpYVkp01_0705-4-10-RND9113

p1390000-IBUCH_pYIpYVkp01_0705-3-10-RND2928

p2200000-IBUCH_pYIpYVk52804-9-10-RND5157


I'm not sure what the quadro cards are equivalent to....8 series, 9, 200.....

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9754 - Posted: 14 May 2009 | 14:23:29 UTC - in response to Message 9750.

Please look at the driver thread.
gdf

Post to thread

Message boards : Graphics cards (GPUs) : 6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]

//