Advanced search

Message boards : Number crunching : Bad batch of TONI-AGGd tasks

Author Message
Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,892,442,778
RAC: 19,949,798
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 27194 - Posted: 2 Nov 2012 | 9:50:33 UTC

Both my active hosts - GTX470 and GTX670 - are showing "Energies have become nan" errors for all recent tasks in this batch (short queue).

http://www.gpugrid.net/results.php?hostid=43404&offset=0&show_names=1&state=5&appid=18
http://www.gpugrid.net/results.php?hostid=132158&offset=0&show_names=1&state=5&appid=18

Replication numbers are up to _5, _6, _7 - all wingmates are affected too.

[AF>Belgique] bill1170
Send message
Joined: 4 Jan 09
Posts: 13
Credit: 1,292,573,895
RAC: 3,498,181
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 27195 - Posted: 2 Nov 2012 | 10:00:32 UTC - in response to Message 27194.

Same for me. The problem started November 1st around 22:00 UTC. I had to stop temporarily GPUgrid as I receive only WUs of this faulting "Toni" batch.

http://www.gpugrid.net/workunit.php?wuid=3796015

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 27199 - Posted: 2 Nov 2012 | 16:56:35 UTC

I just had to download 5 before I got a NOELIA that would run. That brings my total to 8 failed TONI wu's, it sucks because GPUGRID doesn't know that a task has failed and one of my video cards sits idle for sometime.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 27201 - Posted: 3 Nov 2012 | 0:48:59 UTC - in response to Message 27199.

Yes,
these are wrong, I have just notified Toni.

gdf

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 27202 - Posted: 3 Nov 2012 | 10:39:31 UTC - in response to Message 27201.
Last modified: 3 Nov 2012 | 10:40:18 UTC

Sorry guys, I cancelled them. I was fooled because some of them went ok. Those which fail, appear to do so at the start.

Thanks for the patience.

T

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 27237 - Posted: 7 Nov 2012 | 0:59:37 UTC - in response to Message 27202.

Thanks Toni for you're quick action, they seem to be running fine now on all my GPU's.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 27251 - Posted: 8 Nov 2012 | 9:10:55 UTC - in response to Message 27237.

Thanks to you. It took a week to debug, but was very instructive in the end. Fixed workunits are called "AGGd2", and should have high GPU usage.

Area 51
Send message
Joined: 11 Feb 12
Posts: 1
Credit: 4,090,110
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 27315 - Posted: 15 Nov 2012 | 20:04:56 UTC

Doesn't appear to be working for me. Just noticed my industrial quantity of errors. Two slightly different water cooled 580s running at stock speeds.

Two examples:

http://www.gpugrid.net/result.php?resultid=6049716

and

http://www.gpugrid.net/result.php?resultid=6049638

Thoughts or suggestions? GPUGrid suspended pending advice!!!!

Profile microchip
Avatar
Send message
Joined: 4 Sep 11
Posts: 110
Credit: 326,102,587
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 27401 - Posted: 24 Nov 2012 | 16:45:59 UTC

I also get TONI WUs that error out, with either "Energies have become nan" errors or, after a short period of crunching, with "output file absent" errors. This is really starting to annoy me. More so as I also get NOELIA WUs that run till the end only to report "output file absent"

Bruce Kennedy
Send message
Joined: 15 Jan 09
Posts: 3
Credit: 171,242,754
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 27413 - Posted: 25 Nov 2012 | 19:00:44 UTC
Last modified: 25 Nov 2012 | 19:01:08 UTC

I'm also getting Toni and NOELIA WU's failing after several hous. I'm setting to no new work for a few days to see how this shakes out.

Post to thread

Message boards : Number crunching : Bad batch of TONI-AGGd tasks

//