Author |
Message |
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
The new submitted workunits called KASHIF_??? should now work even on G90 cards. The large KASHIF_dim workunits have been reduced by half length, and the data upload by 4 times.
There could be around old workunits with the same name, you could look at the creation date on the web site.
Changes have been applied now:
20 May 16:44 CEST
Hope it works.
gdf |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
For the Devs
Regards
Zy
|
|
|
|
That's music to my ears :)
Can you tell us a little about the problem and its fix?
EDIT:
GDF wrote: So, it seems that there is a bug in the compiler/hardware which appears only on pre G200 cards.
We found a way to avoid it for now, but it limits what we can do, so it is not a solution.
Seems like the ball is in nVidias court now.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
A bug in a routine of cuda FFT.
gdf |
|
|
|
I just had a couple of these:
"Cuda error in file '..\cuda/cutil.h' in line 968 : out of memory.
Memory usage: host: bytes device: bytes
Assertion failed: 0, file ..\cuda/cutil.h, line 968
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
"
Is that a similar error we're talking about ?
WU: 482275 and 482302 (IBUCHs) I notice these are all on GPU1 which may indicate a local problem. |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Some time ago this was an Nvidia driver problem which was sorted with latest drivers.
gdf |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Nvidia is looking into the bug.
gdf |
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
Scientific publications
|
I had this WU die. It was run on a GTS250 (G92 chipset I believe) using 185.85 drivers. It will get reported later tonight, so i'm not sure what the actual error is until then.
____________
BOINC blog |
|
|
|
Had a few tasks crashing with a similar error message. The latest was a KASHIF one.
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
The card has been running overclocked but fairly cool. This should be safe but then again, overclocking never is. Hot spring weather may play a role (hot attick)... |
|
|
|
Actually your error message is "Incorrect function. (0x1) - exit code 1 (0x1)", which is quite a generic one. It's not "the nasty bug" and might be related to OC and temperature.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
Scientific publications
|
I had this WU die. It was run on a GTS250 (G92 chipset I believe) using 185.85 drivers. It will get reported later tonight, so i'm not sure what the actual error is until then.
Turns out its the cuda fft_data_swizzle_in error. So they don't appear to work on GTS250's with the 185.85 drivers.
____________
BOINC blog |
|
|
|
It's not "the nasty bug" and might be related to OC and temperature.
MrS
OK, will throttle back the GPU to half the OC. The core ran about 65°C on hot days (high 50ties during the night) I suspect it will be the memory chips, but for safety measures I'll throttle down the CPU likewise.
|
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
Had a real strange one, and with the new WU's.
I've "lost" one (!)
Sequence is below copying the key parts of the BOINC Manager messages:
25/05/2009 22:37:15 GPUGRID Computation for task p730000-IBUCH_pYEpYVk1_2105-3-10-RND7622_0 finished
25/05/2009 22:37:15 GPUGRID Starting 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0
25/05/2009 22:37:16 GPUGRID Starting task 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 using acemd version 664
So far so good .... the RND7622 message correlates with my task list.
Here comes the next one, getting ready for completion of RND3111, downloaded automatically (cache set to 0.1)
26/05/2009 10:09:30 GPUGRID Sending scheduler request: To fetch work.
26/05/2009 10:09:30 GPUGRID Requesting new tasks
26/05/2009 10:09:35 GPUGRID Scheduler request completed: got 1 new tasks
26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-LICENSE
26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-COPYRIGHT
26/05/2009 10:09:37 GPUGRID Started download of p1480000-IBUCH_pYEpYIk1_2105-3-p1480000-IBUCH_pYEpYIk1_2105-2-10-RND5345_1
Doing good correlates with Task list...... bare with me...
26/05/2009 12:13:53 climateprediction.net Scheduler request completed: got 0 new tasks
26/05/2009 12:14:43 GPUGRID Computation for task 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0 finished
26/05/2009 12:14:48 GPUGRID Starting p1480000-IBUCH_pYEpYIk1_2105-3-10-RND5345_0
26/05/2009 12:14:48 GPUGRID Starting task p1480000-IBUCH_pYEpYIk1_2105-3-10-RND5345_0 using acemd version 664
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_0
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_1
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_2
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_3
26/05/2009 12:14:50 GPUGRID Started upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_4
26/05/2009 12:14:57 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_0
26/05/2009 12:15:32 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_3
26/05/2009 12:15:44 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_1
26/05/2009 12:15:44 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_2
26/05/2009 12:28:49 GPUGRID Finished upload of 4-KASHIF_HIVPR_mon_ba1-14-100-RND3111_0_4
Kashif RND3111 has now finished and uploaded, RND5345 has now started. All good, the latter still crunching away.
Problem is, the Kashif has disappeared from sight, there is no record of it either being downloaded as a task in the first place, nor uploaded when it was finished, nothing in my Task list at all. If I hadnt seen it coming through this end, and "blinked" it would have come and gone without me knowing .... something has stopped its recording as being issued, and something stopped it being recorded in Task list as complete. Suspect the credit side went wonky as well, but the key issue, is the WU which was "never issued" was crunched and returned, but according to the Task list never existed nor returned.
I have no doubt it lurks on the server somewhere right now and server side all is probably normal, its not normal this end. It was uploaded and crunched, and the thought occured that since these are "new" WUs, maybe an unknown bug lurks .... dont know, but its wierd enough to report it.
If that makes sense rofl:)
Looks like Hollywood released Gremlins 5 and we were a secret Alpha for the pesky critters, and they eat my WU :)
Regards
Zy |
|
|
|
I suspect it will be the memory chips
Memory almost never fails due to higher temperatures.. unless pushed really hard. (that's because in contrast to CPU and GPU the memory frequency is not limited by temperature to begin with)
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|