Author |
Message |
|
51-KASHIF_HIVPR_dim_ba1-2-100-RND4878_1
53+ hours of continuous computing, computer finishes the workunit, only to report a "COMPUTATIONAL ERROR" and the big fat "0" points awarded. Looks like I'm going to have to abort these longer workunits in the future. It's not worth the frustration of a computational error.
Name 51-KASHIF_HIVPR_dim_ba1-2-100-RND4878_1
Workunit 424829
Created 1 May 2009 13:02:02 UTC
Sent 1 May 2009 13:15:56 UTC
Received 3 May 2009 18:51:26 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -185 (0xffffffffffffff47)
Computer ID 27410
Report deadline 6 May 2009 13:15:56 UTC
CPU time 1208.969
stderr out <core_client_version>6.4.7</core_client_version>
<![CDATA[
<message>
Can't write init file: -108
</message>
]]>
Validate state Invalid
Claimed credit 8076.97800925926
Granted credit 0
application version 6.64
I know you got some use out of this because it sent in a 51.94 MB completion file. |
|
|
uBronan![Avatar](user_profile/images/15273_avatar.jpg) Send message
Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level
![Glycine - More than 500K credits Gly](img/badges/aa/badge_gly.png) Scientific publications
![Top 50% (772nd/2932) contribution to Buch et al, J. Chem. Inf. Model. 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (1445th/2466) contribution to Sadiq et al, Proteins 2010 wat](img/badges/papers/badge_pub_silver.png) ![Top 50% (1389th/3118) contribution to Selent et al, PLoS Comput Biol 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 90% (3754th/4410) contribution to Buch et al, PNAS 2011 wat](img/badges/papers/badge_pub_bronze.png) ![Top 50% (4342nd/9662) contribution to Buch et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_gold.png) |
looks to me there is a error made by programming :
<core_client_version>6.6.26</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9600 GT"
# Clock rate: 1674000 kilohertz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 8
# Number of cores: 64
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 61 : unknown error.
</stderr_txt>
]]>
3th in arow which failed |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
![Alanine - More than 1M credits Ala](img/badges/aa/badge_ala.png) Scientific publications
![Top 50% (878th/2932) contribution to Buch et al, J. Chem. Inf. Model. 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (605th/2466) contribution to Sadiq et al, Proteins 2010 wat](img/badges/papers/badge_pub_ruby.png) ![Top 50% (840th/3118) contribution to Selent et al, PLoS Comput Biol 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (2182nd/9662) contribution to Buch et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_ruby.png) |
I just had one dump out on me a couple of minutes ago at the start of processing
http://www.gpugrid.net/workunit.php?wuid=431994
Looking at other threads, others have had this type go bang in the last 24hrs, maybe there is a bad one out there ?? Rare I know, but its an inescapable thought - some traditionaly "reliable" high volume crunches have had one go bang (eg Paul) - would be worth digging a little, it seems a bit strange ....
Regards
Zy |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
![Alanine - More than 1M credits Ala](img/badges/aa/badge_ala.png) Scientific publications
![Top 50% (878th/2932) contribution to Buch et al, J. Chem. Inf. Model. 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (605th/2466) contribution to Sadiq et al, Proteins 2010 wat](img/badges/papers/badge_pub_ruby.png) ![Top 50% (840th/3118) contribution to Selent et al, PLoS Comput Biol 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (2182nd/9662) contribution to Buch et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_ruby.png) |
Paul
The one I posted above is coming your way - you just downloaded it ..... :)
Regards
Zy |
|
|
|
I'd rather have a workunit dump at the begining rather than after it has completed its processing and is reported back to Grid servers. It was a waste of computing power and time. |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
![Alanine - More than 1M credits Ala](img/badges/aa/badge_ala.png) Scientific publications
![Top 50% (878th/2932) contribution to Buch et al, J. Chem. Inf. Model. 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (605th/2466) contribution to Sadiq et al, Proteins 2010 wat](img/badges/papers/badge_pub_ruby.png) ![Top 50% (840th/3118) contribution to Selent et al, PLoS Comput Biol 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (2182nd/9662) contribution to Buch et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_ruby.png) |
Thats for sure - dont know about this one, could have been my error, will be interesting to see if Paul gets through it.
Regards
Zy |
|
|
|
I just had one dump out on me a couple of minutes ago at the start of processing
http://www.gpugrid.net/workunit.php?wuid=431994
Um, you are not going to like this ... I am two hours in (2:18) and 18.3% done.
Running just fine on my GTX295 card ... 9:22 hours to go ...
For a small batch run I sure am getting a lot of them ...
Hmmm, I wonder if there is a memory issue?
However I do have this crash on 13-KASHIF_HIVPR_mon_ba3-6-100-RND2474_0
Though there is no real specific error, I got the Incorrect function. (0x1) - exit code 1 (0x1) error. It has already crashed for another person too ...
I don't have as many as I first thought, only about 5 completed, plus the one error and the two in work. |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
![Alanine - More than 1M credits Ala](img/badges/aa/badge_ala.png) Scientific publications
![Top 50% (878th/2932) contribution to Buch et al, J. Chem. Inf. Model. 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (605th/2466) contribution to Sadiq et al, Proteins 2010 wat](img/badges/papers/badge_pub_ruby.png) ![Top 50% (840th/3118) contribution to Selent et al, PLoS Comput Biol 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (2182nd/9662) contribution to Buch et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_ruby.png) |
Interesting, nicely done :)
Hmmmm wonder why it went bang for me ? First one for a while, all seems ok, one of those things at present. Thanks for the heads up, I'll keep my eye open more than usual in case something lurketh.
Regards
Zy |
|
|
|
I still wonder if it is not something to do with GPU memory size, mine is nearly twice yours ... |
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
![Leucine - More than 200M credits Leu](img/badges/aa/badge_leu.png) Scientific publications
![Top 25% (295th/2932) contribution to Buch et al, J. Chem. Inf. Model. 2010 wat](img/badges/papers/badge_pub_ruby.png) ![Top 10% (113th/2466) contribution to Sadiq et al, Proteins 2010 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (135th/3118) contribution to Selent et al, PLoS Comput Biol 2010 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (106th/4410) contribution to Buch et al, PNAS 2011 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (43rd/2450) contribution to Giorgino et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (61st/9662) contribution to Buch et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (296th/3113) contribution to Giorgino et al, J. Chem. Theory Comput, 2012 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (152nd/5798) contribution to Sadiq et al, PNAS 2012 wat](img/badges/papers/badge_pub_emerald.png) ![Top 25% (307th/1995) contribution to Venken et al, JCTC 2013 wat](img/badges/papers/badge_pub_ruby.png) ![Top 10% (260th/3349) contribution to Buch et al, JCIM 2013 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (365th/3864) contribution to Dainese et al, Biochem. J. 2013 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (119th/4477) contribution to Pérez-Hernández et al, JCP 2013 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (111th/2163) contribution to Bisignano et al. JCIM 2014 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (267th/2838) contribution to Stanley et al, Nat Commun 2014 wat](img/badges/papers/badge_pub_emerald.png) ![Top 25% (364th/3183) contribution to Lauro et al., JCIM 2014 wat](img/badges/papers/badge_pub_ruby.png) ![Top 50% (1316th/3611) contribution to Ferruz et al., JCIM 2015 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (2957th/4128) contribution to Ferruz et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (95th/4815) contribution to Stanley et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_emerald.png) ![Top 100% (1255th/1348) contribution to Doerr et al, JCTC 2017 wat](img/badges/papers/badge_pub_white.png) ![Top 25% (812th/4634) contribution to Martinez-Rosell et al, JCIM 2018 wat](img/badges/papers/badge_pub_ruby.png) ![Top 50% (3103rd/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) |
looks to me there is a error made by programming :
<core_client_version>6.6.26</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9600 GT"
# Clock rate: 1674000 kilohertz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 8
# Number of cores: 64
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 61 : unknown error.
</stderr_txt>
]]>
3th in arow which failed
You are showing as running the (beta) 185.81 driver. I had problems with it too (and the swizzle_out error on one wu). I'm now running 182.50 which seems to work.
____________
BOINC blog |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
![Alanine - More than 1M credits Ala](img/badges/aa/badge_ala.png) Scientific publications
![Top 50% (878th/2932) contribution to Buch et al, J. Chem. Inf. Model. 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (605th/2466) contribution to Sadiq et al, Proteins 2010 wat](img/badges/papers/badge_pub_ruby.png) ![Top 50% (840th/3118) contribution to Selent et al, PLoS Comput Biol 2010 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (2182nd/9662) contribution to Buch et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_ruby.png) |
I still wonder if it is not something to do with GPU memory size, mine is nearly twice yours ...
Had a quick look at past ones, I have done two other KASHIF_HIVPR WUs.
http://www.gpugrid.net/workunit.php?wuid=414191
http://www.gpugrid.net/workunit.php?wuid=421636
They went through ok. No idea if they were the "same" as such as the one that went bang. The latter may well have been something I did at the time, the CUDA card runs on my Home Office main beastie - normally my activities on it have not been an issue, may have been this time.
Just posted the above for completeness in case it throws up anything of interest.
Regards
Zy |
|
|
|
I have usually these message and then wus errors :
05/05/2009 06:41:14 GPUGRID Computation for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 finished
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_1 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_2 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_3 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
on driver 185.81, boinc 6.6.20 and vista 64. :/
and my host : http://www.gpugrid.net/results.php?hostid=31684
with 1 gtx 260 and 1 8800 GT |
|
|
Beyond![Avatar](https://www.gravatar.com/avatar/026deda0a0d87168ee4e605155a8e102?s=100&d=identicon) Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
![Tyrosine - More than 5B credits Tyr](img/badges/aa/badge_tyr.png) Scientific publications
![Top 25% (348th/2932) contribution to Buch et al, J. Chem. Inf. Model. 2010 wat](img/badges/papers/badge_pub_ruby.png) ![Top 25% (281st/2466) contribution to Sadiq et al, Proteins 2010 wat](img/badges/papers/badge_pub_ruby.png) ![Top 10% (165th/3118) contribution to Selent et al, PLoS Comput Biol 2010 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (133rd/4410) contribution to Buch et al, PNAS 2011 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (45th/2450) contribution to Giorgino et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (42nd/9662) contribution to Buch et al, J. Chem. Theory Comput. 2011 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (119th/5798) contribution to Sadiq et al, PNAS 2012 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (17th/2163) contribution to Bisignano et al. JCIM 2014 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (19th/1283) contribution to Doerr et al. JCTC 2014 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (19th/2838) contribution to Stanley et al, Nat Commun 2014 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (148th/3183) contribution to Lauro et al., JCIM 2014 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (49th/3611) contribution to Ferruz et al., JCIM 2015 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (8th/4128) contribution to Ferruz et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (124th/4815) contribution to Stanley et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (6th/4730) contribution to Noe et al., Nat Chem 2017 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 1% (7th/1348) contribution to Doerr et al, JCTC 2017 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 1% (9th/4634) contribution to Martinez-Rosell et al, JCIM 2018 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 1% (9th/1656) contribution to Kapoor et al., Sci Rep 2017 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 1% (15th/1885) contribution to Ferruz et al., Sci Rep 2018 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (17th/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_emerald.png) ![Top 25% (137th/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_ruby.png) ![Top 10% (140th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (11th/1450) contribution to Herrera-Nieto et al, Sci Rep 2020 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (71st/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) |
I had my only error ever a couple days ago on a KASHIF WU. They also take too long on slower cards.
Is there a way to set the client not to DL these or do we just have to watch for them? |
|
|
|
I have usually these message and then wus errors :
05/05/2009 06:41:14 GPUGRID Computation for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 finished
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_1 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_2 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_3 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
on driver 185.81, boinc 6.6.20 and vista 64. :/
and my host : http://www.gpugrid.net/results.php?hostid=31684
with 1 gtx 260 and 1 8800 GT
You have it backwards: actually you get the error first, then the WU is marked as finished and then BOINC complains about the missing files. Which, I suppose, are not there because the WU was terminated unusually instead of successfully writing result files before gracefully shutting down.
and my Vista 64 machine is running 185.66 and 6.5.0 without problems. You might want to try this driver.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
So it's a pb with Boinc, or my drivers ?
Thanks. |
|
|
|
Neither: this BOINC message tells us nothing ecept that there was an error.
Apart from this: since the 8th may all your WUs have errored out. What did you change? You clocked your 8800GT down, which shouldn't cause these errors. I suspect an upgrade to a new beta driver, which somehow messes things up. You might want to try a proven version like 182.50 or 182.08 and remove the newer one with some driver cleaner. You could also upgrade to BOINC 6.6.23, since it fixed at least one major bug in 6.6.20. You could also try with only one card installed to reduce the amount of variables in your config.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|