2 Jobs fail consecutively with an error I have never encountered before

Message boards : Graphics cards (GPUs) : 2 Jobs fail consecutively with an error I have never encountered before

Author	Message
schizo1988 Send message Joined: 16 Dec 08 Posts: 16 Credit: 10,644,256 RAC: 0 Level Scientific publications	Message 9381 - Posted: 6 May 2009 \| 16:33:17 UTC
	I had my last 2 jobs fail and don't want it to continue but I don't understand the error message or what I can do if anything to prevent it. Fortunately one happened fairly early on but the other one was almost complete. Any advice would be welcome. Below is one of the mesages but they were both the same. Thanks <core_client_version>6.6.20</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce GTX 260" # Clock rate: 1548610 kilohertz # Total amount of global memory: 939524096 bytes # Number of multiprocessors: 27 # Number of cores: 216 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" # Using CUDA device 0 # Device 0: "GeForce GTX 260" # Clock rate: 1548610 kilohertz # Total amount of global memory: 939524096 bytes # Number of multiprocessors: 27 # Number of cores: 216 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [shake_step_1] failed in file 'shake.cu' in line 79 : unknown error. </stderr_txt> ]]>
	ID: 9381 \| Rating: 0 \| rate: / Reply Quote

X1900AIW Send message Joined: 12 Sep 08 Posts: 74 Credit: 23,566,124 RAC: 0 Level Scientific publications	Message 9386 - Posted: 6 May 2009 \| 19:14:31 UTC - in response to Message 9381.
	# Clock rate: 1548610 kilohertz Your OC settings changed, last sucessful workunits based on lower shader clock rates: # Clock rate: 1537627 kilohertz. Perhaps you changed just VRAM or other settings, which caused the instability on its one.
	ID: 9386 \| Rating: 0 \| rate: / Reply Quote

schizo1988 Send message Joined: 16 Dec 08 Posts: 16 Credit: 10,644,256 RAC: 0 Level Scientific publications	Message 9387 - Posted: 6 May 2009 \| 19:14:38 UTC - in response to Message 9381. Last modified: 6 May 2009 \| 19:22:23 UTC
	It has now become 5 jobs in a row that have failed. I will try removing the over clock but this card has been running for months. Of course I did do an upgrage from a beta of Windows 7 to the Release Candidate recently. Now I am more confused as the last job on my i7 Dual 295's machine as well
	ID: 9387 \| Rating: 0 \| rate: / Reply Quote

Dieter Matuschek Send message Joined: 28 Dec 08 Posts: 58 Credit: 231,884,297 RAC: 0 Level Scientific publications	Message 9389 - Posted: 6 May 2009 \| 19:40:21 UTC - in response to Message 9381. Last modified: 6 May 2009 \| 20:18:00 UTC
	# Encounter 10-12 H-bond term Doesn't this mean that the WU runs into a time limit? I too have a problem with current WUs on one of my computers with a 9800 GTX+. They are running way too slow. Today I've aborted WU 631873 after 16 hours @ progress of 17%. Now on the same card there is WU 636672 @ 0.479% after 1 hour 21 minutes! I wonder whether this video card is damaged. But perhaps it is a problem with these WUs ... EDIT Problem identified: These WUs sometimes 'hang'. With restarting those 'hangs' can be overcome. ____________
	ID: 9389 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 387,028,788 RAC: 1,197,795 Level Scientific publications	Message 9393 - Posted: 6 May 2009 \| 19:57:39 UTC
	I've had an "IBUCH_KID" WU error out recently, as have several others. Anyone able to process these without issue?
	ID: 9393 \| Rating: 0 \| rate: / Reply Quote

schizo1988 Send message Joined: 16 Dec 08 Posts: 16 Credit: 10,644,256 RAC: 0 Level Scientific publications	Message 9395 - Posted: 6 May 2009 \| 20:20:30 UTC - in response to Message 9393.
	This one was valid but it lists warnings about the same things that caused them to fail before, it unfortunate that you only see the warnings after the job finishes, so you can never use it to avoid it being invalid <core_client_version>6.6.20</core_client_version> <![CDATA[ <stderr_txt> # Using CUDA device 2 # Device 0: "GeForce GTX 295" # Clock rate: 1512000 kilohertz # Total amount of global memory: 939261952 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 1: "GeForce GTX 295" # Clock rate: 1512000 kilohertz # Total amount of global memory: 939196416 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 2: "GeForce GTX 295" # Clock rate: 1512000 kilohertz # Total amount of global memory: 939261952 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 3: "GeForce GTX 295" # Clock rate: 1512000 kilohertz # Total amount of global memory: 939261952 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" # Using CUDA device 2 # Device 0: "GeForce GTX 295" # Clock rate: 1512000 kilohertz # Total amount of global memory: 939261952 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 1: "GeForce GTX 295" # Clock rate: 1512000 kilohertz # Total amount of global memory: 939196416 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 2: "GeForce GTX 295" # Clock rate: 1512000 kilohertz # Total amount of global memory: 939261952 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 3: "GeForce GTX 295" # Clock rate: 1512000 kilohertz # Total amount of global memory: 939261952 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. # Time per step: 35.888 ms # Approximate elapsed time for entire WU: 17943.750 s called boinc_finish </stderr_txt> ]]>
	ID: 9395 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 9539 - Posted: 9 May 2009 \| 13:44:13 UTC
	- these warnings about "H-bond terms" don't mean anything to you, it's the science (hydrogen-something bonds, possibly to carbon atoms, which are now calculated by a new method) - schizo, your OC is quite high. Also remember that the shader clock is actually changed in discrete steps of 54 MHz, you either get 1512 or 1566 MHz. Setting 1548 instead of 1537 likely pushed you over the threshold for a real clock of 1566 MHz, hence the failures. - now you completed 2 at a setting of 1470 Mhz just fine. - Klat, this is not related to the "IBUCH_KID"-issue (as you surely already noticed, since you also posted in the corresponding thread) - Dieter, you're another victim of 6.6.20. Upgrade to 6.6.23 or downgrade to 6.5.0 or 6.4.7 to fix this problem. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 9539 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : 2 Jobs fail consecutively with an error I have never encountered before

	About	Science	Volunteers	Performance	Forum	Join us	Donate