Author |
Message |
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
We are monitoring causes for some crashes and their causes, in some cases even with the help of the volunteer which gives us access to their machine to do tests.
NORMAL BEHAVIOUR IS THAT YOU EXPERIENCE NO CRASH AT ALL (let's say <1%).
These are the most common cases of errors:
1) OVERCLOCKING.
SYNTHOMS: the application succeed but more or less often the application crashes with errors randomly appearing in several different GPU kernels (shake, langevin,pme, whatever).
SOLUTION. Reduce the clocks to the reccomended clocks for your board (note that some manufacturers increase the clock, so it might be that you did not do anything but the gpu is actually overclocked). See wikipedia for correct frequencies.
2) POOR COOLING.
SYNTHOMS: Same as before random errors on different kernels.
SOLUTIONS: If your board is not overclocked according to the number given by wikipedia, then it could cooling. Open your case or buy extra fans. Air has to come in from the front of the gpu and leave from the rear.
3) NVIDIA bugs
SYNTHOMS: you change driver and it stops working or if the error is always on the same kernel (PME, FFT. Now for instance we have the infamous FFT bug)
SOLUTIONS: If the driver works do not update unless you need it for some game. If it stops working, then try to update the driver.
The fft bug reported to Nvidia by us was solved on 190 drivers for G80 chips. It is still there for some GTX216 cards (it is unclear if these 216 work with 182 drivers. Try.)
4) BOINC bug
SYNTHOMS: Various
SOLUTIONS: Stick to a client that works for you, only change if we require to do so or you are willing to experiment.
5) POOR DRIVER INSTALLATION
SYNTHOMS: You can't run any workunits at all and the application crashes immediately. This is ofter a problems for Windows users.
SOLUTIONS. Reinstall the drivers in a proper way. Try this: http://www.guru3d.com/category/driversweeper/
This is rather common on WIndows machines.
In general, new drivers and new BOINC versions add features and solve old bugs, but as well introduce new ones. This is normal, find your best equilibrium.
Happy crunching.
GDF |
|
|
123bobSend message
Joined: 21 Dec 08 Posts: 7 Credit: 251,750,735 RAC: 0 Level
Scientific publications
|
The fft bug reported to Nvidia by us was solved on 190 drivers for G80 chips. It is still there for some GTX216 cards (it is unclear if these 216 work with 182 drivers. Try.)
GDF
GDF, it is very clear to me that the one eVGA 260-216 that I'm having issues with works just fine on driver 182.50. It will not work on anything 185.xx or higher. It just shoots out errors on the higher drivers. (Machine #20013) The card part number is 896-P3-1267-FR. It's their "superclocked" edition.
My other 260-216s seem to be working fine on a mix from 185.85 to the newest driver, 190.62.
Hope this helps others.
Bob
|
|
|
zpm Send message
Joined: 2 Mar 09 Posts: 159 Credit: 13,639,818 RAC: 0 Level
Scientific publications
|
TBH, my card is already FOC, but just to try some stuff, i did overclock my 216 core 260 card to 675 mhz from 630. this didn't produce any errors and not much of a speed up really.
i keep my fan on a constant 70 % Fan speed whether i'm gaming or cudaing.
i'm gone throug the 6 series, and no real errors to speak of.
running 6.10.0 right now. |
|
|
|
I have a GTX 260 192 that will not run any game or GPUgrid on any driver above 182.50. With 182.50 it ran GPUGrid fine and F@H fine with never an error. |
|
|
zpm Send message
Joined: 2 Mar 09 Posts: 159 Credit: 13,639,818 RAC: 0 Level
Scientific publications
|
I have a GTX 260 192 that will not run any game or GPUgrid on any driver above 182.50. With 182.50 it ran GPUGrid fine and F@H fine with never an error.
no game, sounds like you now have a bad card!!!!! |
|
|
|
Just as a general comment across different operating systems using WindowsXP x64, Suse Linux, Sabayon and Ubuntu, I have found the following to be true with my GTX 260 and AMD X2.
1. In Linux, I cannot start the BOINC manager and Gpugrid with the video card overclocked. I have to start the program, run it for 5 minutes and then suspend calculations to start the overclock.
2. To overclock successfully, I do much better if I use a light window manager like IceWM in Linux or change the Windows video settings to maximum performance versus highest quality.
3. I have been able to run Gpugrid with any Nvidia driver that was supported except for the period when the Linux 185 drivers would error out all the time. I am now using Ubuntu 64bit with the Nvidia 190 driver with no issues.
4. I can run any video operation with the current drivers and run Gpugrid. I prefer to suspend Gpugrid if I am transcoding video or doing heavy file copying operations.
5. Linux has been the most stable setup for running Gpugrid day in and day out without issues. Windows XP x64 was a close second. Windows Vista 32 bit was less stable with me for some reason.
6. If the system crashes for any reason in any operating system, I am better off either deleting the affected work units or resetting the operation as soon as I start BOINC.
7. The first time that I start a new install of BOINC and Gpugrid, it will always freeze. I then reboot, do a reset of the project and proceed.
8. Being more demanding on the video card means that Gpugrid is less stable than Folding@home, Seti, Aqua or Einstein cuda applications. The Gpugrid program does more useful work and demands less from my CPU.
|
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
Too many people trust graphics driver deinstall routines via Windows deinstall, used to include me, despite being told many times over the years - clean out drivers once you have deinstalled and before installing the new graphics drivers. I learnt recently the hard way not to be idle about this, and to religiously go through the clean out routine. There has been yet another example of this over on the number crunching forum.
It boils down to a simple fact - Windows deinstall routine will not delete a file if its flagged as "in use". On top of that is the fact that not all graphics drivers file setup is the same, will vary from version to version. End result - bits of old driver installs left behind that will cause issues with the new driver.
NVidia used to make great play of proper deinstall, they dont now, I suspect some PR guru clown has poked his nose into the real world ... and this is not only NVidia, it applies to ATI drivers as well. The Guru3d Driver Cleaner includes sweeping ATI drivers as well for that reason. Similar issues occur with sound drivers.
Some make great play of switching drivers left right and centre trying to squeeze out the last performance drop - whether or not it achieves that is for another day - what it will achieve is a rapid build up of undeleted garbage that is not always over-written by a new install, no matter how a deinstall was done (indeed if they did a deinstall at all - many are just installing over the top of existing installations.
Deinstall old drivers via windows, then boot into safe mode, run Guru3d driver sweeper, reboot, and install the new drivers. Its only a few minutes extra work, but will save days if not weeks of grief.
Its not the beall and endall of all graphics issues, thats for sure, but I'll bet my pension its in the majority .....
Regards
Zy |
|
|
|
Has anybody done any analysis of when tasks fail? I mean, what time of day? Mine show a distinct tendency to fail in the early hours of the morning (between 3am and 6am, local time).
It's not so farfetched that there could be a correlation. I've noticed when working on server installations with UPSs that the electricity supply voltage can vary over 24 hours - lower when local demand is high, higher when everyone is asleep in bed and most appliances are switched off.
So a cooling solution which is adequate when you're around to measure it might be inadequate under higher power draw (likely if the input voltage is higher). |
|
|
=Lupus=Send message
Joined: 10 Nov 07 Posts: 10 Credit: 12,777,491 RAC: 0 Level
Scientific publications
|
Palit GeForce GTX 260 Sonic 216 SP - Vista64 - 190.62 x64 drivers, no issues, everything working fine. No failure until now.
This card is slightly overclocked from vendor-side, 625 mhz instead of 585 mhz. But has two fans so it is on 55°C even after some days of permanent work. |
|
|
VictorSend message
Joined: 16 Aug 09 Posts: 1 Credit: 542,905 RAC: 0 Level
Scientific publications
|
I've got a MSI GTX 260 OCv3, windows Media center (32bits), AMD64 3400+ (such an old processor for this video card) in a emachines and I've never have had a problem with errors in GPUGRID, the only error I got was from cancelling the first task because I had a GeForce 8400GS and tried GPU grid (until I read the GPU supported for this project and really, the speed it was processing, was ridiculous (like .04% in an hour or so).
keep in mind that this graphic card is 655/1408/2100 in comparison to the stock one of 576/1240/1998
I have to recognize tough that in Seti@Home it produced a lot of errors in comparison to when I was using the 8400GS for that project where i didn't have errors or maybe one or two only. Altough I had a lot of tasks turned in fast, I got a lot of errors some times, like 2 to 5 in a row. |
|
|
|
Hello everyone. First ask forgiveness for my English, as it is translated with google.
Again I post because I still can not make any GPUGRID unit. I've marked this post http://www.gpugrid.net/forum_thread.php?id=1172&nowrap=true#10792 and still the same problem.
I have tried all these drivers 181.20, 185.85, 186.18, 190.38, 190.6 and all these versions of boinc 6.6.20, 6.6.28 and 6.6.36.
The graph is a Zotac gtx 260 (216) with the values of manufactures 576/1242/999. The computer is a i7 920 with 6 gb ram on gigabyte board. OS win7 64. The temperature of the graph does not exceed 70 º.
I know that are having more trouble with the 260 than with any but not normal since I have been more than 1 month with folding Collanzo or task and not one has given me error. I squeezed the playing card for over 4 hours to games that require it the most and not a single error.
Something is happening and are not resolved. The error always comes from the same site, the famous nvlddmkm you see in the event viewer.
I hope I can help. Thanks and best regards. |
|
|
JetSend message
Joined: 14 Jun 09 Posts: 25 Credit: 5,835,455 RAC: 0 Level
Scientific publications
|
I was make some analyses of fails, basically, no system.
Should say, that the key problem is OC'ing, the consequence of this is slight overheating (close to the edge of stability), than, probably, power surge on the edge of the load. Small spike is enough for fail if your cards are running very close to the max power rate of the power supply. In short words with facts:
1. PowerLux power supply, rated 750 watt.
2. 3 x GTX 260 Matrix by ASUS, with two fans & heat pipe system.
3. Intel Quad Core Q 9550, om ASUS WS Evolution board, + 10 % OC'ed. Running MW on all 4 cores, so 100+% of the power load.
4. Manually OC'ed from 576 Mhz gpu / 999 mhz mem to 756 Mhz gpu / 1111 Mhz mem.
5. Cards are sitting very close to each other. The card in the middle, due to the lack of the incoming air, are normally 57-60C. This to high, that the reason for additional external big fan mounted over it + constantly running room conditioning system with a 23C level.
6. Every day I've one typical error: "redundant result" or "computation error", some times this errors are combined, sometimes - not.
7.Additionally should state, that alt GTX 260 are not mechanically fixed in the slots, just used their weight to be fixed on the MoBo. So, could be some pure mechanical \ misconducting reasons, as well.
So, in general, taking a.m. facts into consideration, described system should be an error generator, but in fact - not so.
Regards,
Jet |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
I have added cause 5 to the starting message.
gdf
5) POOR DRIVER INSTALLATION
SYNTHOMS: You can't run any workunits at all and the application crashes immediately. This is ofter a problems for Windows users.
SOLUTIONS. Reinstall the drivers in a proper way. Try this: http://www.gpugrid.net/forum_thread.php?id=1293
|
|
|
|
I have always followed these steps to uninstall-install the drivers.
Restarting a test mode failures, driver and uninstall the program you step driver sweeper, cleaning all that is nvidia. Step ccleaner to clean debris.
Reboot again to test failure mode and install new driver. Reboot again and normal.
A greeting.
|
|
|
_hiVe*Send message
Joined: 18 Feb 09 Posts: 12 Credit: 13,624,069 RAC: 0 Level
Scientific publications
|
I have added cause 5 to the starting message.
gdf
5) POOR DRIVER INSTALLATION
SYNTHOMS: You can't run any workunits at all and the application crashes immediately. This is ofter a problems for Windows users.
SOLUTIONS. Reinstall the drivers in a proper way. Try this: http://www.gpugrid.net/forum_thread.php?id=1293
I suggest linking directly to http://www.guru3d.com/category/driversweeper/
Will save a lot of people the trouble of reading through loads of text, erm. it would be more efficient.
The thread itself doesn't include any practical info, other than the link for the unexperienced. |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Done.
gdf |
|
|
BarryAZSend message
Joined: 16 Apr 09 Posts: 163 Credit: 920,927,307 RAC: 2,890 Level
Scientific publications
|
Nice troubleshooting note. As a follow up, I've one workstation that has started erroring out on GPUGrid (but not Collatz) in the last couple of days.
It is the only workstation I have with a GTS 250. Running Windows XP, on an AMD 945. Some comments to the troubleshooting note for this workstation.
These are the most common cases of errors:
1) OVERCLOCKING.
SYMPTOMS: the application succeed but more or less often the application crashes with errors randomly appearing in several different GPU kernels (shake, langevin,pme, whatever).
SOLUTION. Reduce the clocks to the reccomended clocks for your board (note that some manufacturers increase the clock, so it might be that you did not do anything but the gpu is actually overclocked). See wikipedia for correct frequencies.
I thought that might be the case but I checked to make sure that I'm not overclocking this system.
2) POOR COOLING.
SYMPTOMS: Same as before random errors on different kernels.
SOLUTIONS: If your board is not overclocked according to the number given by wikipedia, then it could cooling. Open your case or buy extra fans. Air has to come in from the front of the gpu and leave from the rear.
Not likely -- I have very good air flow on the workstation and as noted, another BOINC GPU application (Collatz) is quite happy.
3) NVIDIA bugs
SYMPTOMS: you change driver and it stops working or if the error is always on the same kernel (PME, FFT. Now for instance we have the infamous FFT bug)
SOLUTIONS: If the driver works do not update unless you need it for some game. If it stops working, then try to update the driver.
The fft bug reported to Nvidia by us was solved on 190 drivers for G80 chips. It is still there for some GTX216 cards (it is unclear if these 216 work with 182 drivers. Try.)
This was a recent build and the only driver installed was the 190.38. Again, the problem didn't show up until a couple of days ago -- it was working just fine with GPUGrid earlier in the week.
I will agree that Nvidia driver bugs are possible, though one would think this might show up more generally (I've not seen this with the same driver running on 9800GT cards.
4) BOINC bug
SYMPTOMS: Various
SOLUTIONS: Stick to a client that works for you, only change if we require to do so or you are willing to experiment.
Agreed big time -- for non GPU BOINC workstations, I run 5.4.5. For most GPU BOINC workstations (including the one having the *new* problem), I run 6.4.5. I have one workstation running 6.5 and another running the new 6.10 beta (for ATI GPU support).
5) POOR DRIVER INSTALLATION
SYMPTOMS: You can't run any workunits at all and the application crashes immediately. This is ofter a problems for Windows users.
SOLUTIONS. Reinstall the drivers in a proper way. Try this: http://www.guru3d.com/category/driversweeper/
This is rather common on WIndows machines.
In general, new drivers and new BOINC versions add features and solve old bugs, but as well introduce new ones. This is normal, find your best equilibrium.
Not a problem here -- but this point is a very good one generally.
Happy crunching.
GDF
|
|
|
|
Always the same problem at home, with my GTX260 (216) and one 8800 gt on Intel 8400. Vista 32, driver 182.5 and boinc 6.6.36
13/09/2009 16:32:48 GPUGRID Computation for task 225-GIANNI_BIND001-24-100-RND7793_0 finished
13/09/2009 16:32:48 GPUGRID Output file 225-GIANNI_BIND001-24-100-RND7793_0_1 for task 225-GIANNI_BIND001-24-100-RND7793_0 absent
13/09/2009 16:32:48 GPUGRID Output file 225-GIANNI_BIND001-24-100-RND7793_0_2 for task 225-GIANNI_BIND001-24-100-RND7793_0 absent
13/09/2009 16:32:48 GPUGRID Output file 225-GIANNI_BIND001-24-100-RND7793_0_3 for task 225-GIANNI_BIND001-24-100-RND7793_0 absent
|
|
|
BarryAZSend message
Joined: 16 Apr 09 Posts: 163 Credit: 920,927,307 RAC: 2,890 Level
Scientific publications
|
One of the other components of troubleshooting for root causes probably should include a look at the work units being sent.
When computation errors go from say one in 25 (still too high) to 1 in 4 or 1 in 3, with no change on the end user hardware or software configuration, it strikes me that another variable should be considered.
So far, it seems that possible problem source is not being considered, and frankly, from the end user point of view, there is nothing the end user can do to address it. |
|
|
|
I always and state medical projects and I would love to continue with GPUGRID, but after the last update made grpugrid cuda and I have not able to make a single unit of this project.
I have my card in Collanzo processing that is not a project but I especially like doing that better than this to stand.
I would like to sacasen as a solution because we are many people that this problem is happening to us that no processing occurs in other projects.
A greeting. |
|
|
CTAPbIiSend message
Joined: 29 Aug 09 Posts: 175 Credit: 259,509,919 RAC: 0 Level
Scientific publications
|
ubuntu 9.04, core - 2.6.26.15-generic, GTX275 OC'ed (702/1584/1260), BOINC 6.10.4 and 6.10.6, 190.18 and 90.32 nvidia driver.
while BOINC works there is no problem, the whole week 100% stable work w/o any issue. But on weekends I need to switch my rig to windows in order to "work" over Crysis, COD WaW, etc.
I tried to suspend WU's, I'm waiting to processes to disappear from gnome system monitor. But when I'm done with gaming I'm switching back to linux and GPU WU immedeately dies with "Computation error".
So, how should I turn off BOINC in order to keep WU's continue to work? Rosetta works fine.
And the second question is about "do not use GPU when in use". It still using GPU what causes periodical freezes and makes me angry :-) |
|
|
|
I have found that the best way to keep a work unit during a reboot is to
1) Suspend all Gpugrid work units
2) Close the Boinc manager - not by File - Exit but by closing the window.
3) Open an console; change to the Boinc subdirectory and
Stop Boinc via the command line
./boinccmd --quit
In Linux, this will usually allow me to return to the work unit later.
|
|
|
CTAPbIiSend message
Joined: 29 Aug 09 Posts: 175 Credit: 259,509,919 RAC: 0 Level
Scientific publications
|
I have found that the best way to keep a work unit during a reboot is to
1) Suspend all Gpugrid work units
2) Close the Boinc manager - not by File - Exit but by closing the window.
3) Open an console; change to the Boinc subdirectory and
Stop Boinc via the command line
./boinccmd --quit
In Linux, this will usually allow me to return to the work unit later.
OK, thanx bro:-)
I do not know why but 2gay while I'm playing with i2c I reboot the rig 2 or 3 times. I had no dead Wu's. Strange... |
|
|
|
Too many people trust ... Windows.
Regards
Zy
I shortened the above post for ya... ;-)
____________
- da shu @ HeliOS,
"A child's exposure to technology should never be predicated on an ability to afford it." |
|
|
|
I have found that the best way to keep a work unit during a reboot is to
1) Suspend all Gpugrid work units
2) Close the Boinc manager - not by File - Exit but by closing the window.
3) Open an console; change to the Boinc subdirectory and
Stop Boinc via the command line
./boinccmd --quit
In Linux, this will usually allow me to return to the work unit later.
For Ubuntu I just stop boinc as below. Works fine for me and it picks right back up when rebooting back to Ubuntu.:
sudo /etc/init.d/boinc-client stop
____________
- da shu @ HeliOS,
"A child's exposure to technology should never be predicated on an ability to afford it." |
|
|
CTAPbIiSend message
Joined: 29 Aug 09 Posts: 175 Credit: 259,509,919 RAC: 0 Level
Scientific publications
|
I shofted to 6.10.6 and it work fine to me eveni turn it off using "quit" |
|
|
BarryAZSend message
Joined: 16 Apr 09 Posts: 163 Credit: 920,927,307 RAC: 2,890 Level
Scientific publications
|
You might look to 6.10.7 -- it is reported to address the problem that 6.10.6 has with multiple work units -- moving from one to another within a project rather than completing one and moving to the next.
I shofted to 6.10.6 and it work fine to me eveni turn it off using "quit"
|
|
|
|
Hello, and i´m having big troubles with my GF GTX 260, C2D 8500 and GPUGrid.
It crahsed very often with the following error:
"Output file...... for Task ..... absent."
What is going wrong?
Thank you...
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
When you start getting errors you should make some observations, temps of GPU, CPU, board, fan speeds, task failure types, system usage at time of failure.
There are several generic things you can do,
Restart the system (stops system related runaway errors),
Increase fan speed / improve ventilation (reduces temps),
Free up a CPU core/thread (stops some heartbeat issues),
Reduce CPU clocks if the CPU is overclocked (reduces system temperature/motherboard and component overheating issues, especially chipset),
Reduce GPU clocks (start by trying to reduce the memory speed, then move onto the GPU if need be, but you shouldn't have to go below 10%)
Rollback, reinstall or upgrade drivers,
Increase GPU voltage very slightly.
Clean the GPU and system,
Reset the Bios,
Re-seat the GPU (take it out, reboot, power down, re-seat the GPU),
Restore or reinstall the operating system.
Also see original post.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
I am using drivers 275.33, BOINC 6.12.33(x64) and can cause the downclock on demand. Is there any resolution other than rebooting the system?
Thank You |
|
|
|
I am suddenly having a bunch of acemd2 tasks fail. this is the first time this has happend since I started running GPUGRID.
http://www.gpugrid.net/results.php?userid=21556&offset=0&show_names=0&state=5&appid=
____________
What you do today you will have to live with tonight |
|
|