Message boards : Graphics cards (GPUs) : restart.coor....
Author | Message |
---|---|
This one's killing me..... | |
ID: 7788 | Rating: 0 | rate: / Reply Quote | |
Next one error our = restart.coor. Expecting at least 2 more from different rigs.... | |
ID: 7789 | Rating: 0 | rate: / Reply Quote | |
First of all, the real error message is "incorrect function - exit code 1". Disregard the restart.coor, it appears in all WUs, also the one that complete succesfully and validate. | |
ID: 7792 | Rating: 0 | rate: / Reply Quote | |
First of all, the real error message is "incorrect function - exit code 1". Disregard the restart.coor, it appears in all WUs, also the one that complete succesfully and validate. Ok, Thanx Alain for the info. I wasn't aware that the restart.coor appears in sucessful WUs. I rarely check them. I understand that the actual error is pretty meaningless, and thought that the constant restart.coor might have some value....guess not. I have one other possible issue that may be causing this. I've already seriously downclocked one card....I'll get to the other 2 later today. See how that goes and then check out the other possibility.... I may end up having to make a choice, but if so....GpuGrid will win out. It's good to know though that the problem(s) should be on my end. This has been driving me nutz....;) Many Thanx! | |
ID: 7797 | Rating: 0 | rate: / Reply Quote | |
I may end up having to make a choice, but if so....GpuGrid will win out. It's good to know though that the problem(s) should be on my end. This has been driving me nutz....;) If you pick one system, describe it, and list its essential features, what you have tried to this point, link to failed tasks we can all shout suggestions from the sidelines. I suggest one system at a time because sometimes making too many changes on too many places causes confusion ... and that is the last thing needed ... | |
ID: 7803 | Rating: 0 | rate: / Reply Quote | |
Hmm i don't agree that it can't be a software problem. | |
ID: 7804 | Rating: 0 | rate: / Reply Quote | |
Hmm i don't agree that it can't be a software problem. I agree with some of what you said, but disagree with some, too. I agree that the symptoms do not rule out software problems. Just because it sometimes fails and sometimes succeeds doesn't mean it's not software. It just means the bug isn't quite so obvious, or it would have been caught easily in testing. I've been writing software for 30+ years, and it's those bugs that only occur rarely that drive you nuts... but they're still software problems. As for errors due to the GPU's being designed for speed, not accuracy, that's only partially correct. Yes, a few insignificant bits getting flipped won't make a noticeable difference in a 3D game, but could in a scientific calculation. But those flipped bits can also cause crashes in games, so it's not as if occasional errors are an acceptable design feature. The GPUs *SHOULD* be running error free. They're designed to operate without errors. When overclocked, especially by users, the criteria for what is acceptable (from the user's perspective) clock speeds might be "my game doesn't crash", which may, indeed, allow for errors to creep into the calculations. But as shipped from the factory, the GPU shouldn't be making errors. If it is, it's defective. I have a factory overclocked EVGA GTX280. It has yet to error out on any task. One theory that I have, based on what I've read here and on other projects' boards, is that the fan control for some GPUs may not be operating properly, or the cooling may simply be insufficient. That may allow the GPU to overheat and become unstable. My GPU-equipped computer is currently in a fairly cold basement room, and it has no thermal problems currently. The highest temps I ever see are about 78 degrees C, and 75 degrees is the base temperature, below which the fan runs at idle speed (40%). I've never seen the fan running above 50%, and the maximum allowable operating temperature is over 100 degrees, so there's a lot of cooling capacity in reserve. There's lots of things that can cause problems with the WUs. Two of them that are under our control are OC and temperature. Beside trying lower clock speeds, make sure that the GPU temperature is not too high. Mike | |
ID: 7807 | Rating: 0 | rate: / Reply Quote | |
... Which indicates that software is not the problem... This was indeed not accurate enough. Should have been "software setup", i.e. software versions are the correct ones. And indeed this still means that occasionally a WU can crash, it happened to all of us. Also, newer software versions (drivers) might do better. But when so many crash in a short time there are in my view really only three possibilities: a bad batch of WUs, wrong software versions or hardware trouble. Here I do not believe the first two are applicable.. However, always happy to learn something of course. kind regards Alain | |
ID: 7808 | Rating: 0 | rate: / Reply Quote | |
Michael my gpu temps are much below average. | |
ID: 7843 | Rating: 0 | rate: / Reply Quote | |
Webbie, | |
ID: 7847 | Rating: 0 | rate: / Reply Quote | |
hihi yea lol i figured the same but then again pure speculation again ;) | |
ID: 7848 | Rating: 0 | rate: / Reply Quote | |
Another potential source of errors is a power supply that is almost but not quite powerful enough to consistently drive the GPU. Since the GPU's power draw is dependent on how hard it's working, having a marginal power supply might work most of the time, but cause errors once in a while. | |
ID: 7852 | Rating: 0 | rate: / Reply Quote | |
Hey Everybody, | |
ID: 7863 | Rating: 0 | rate: / Reply Quote | |
Just keep in mind that when you're running 8800 GS / GT at around 1.9 GHz you're asking for a lot. Cards can take it, but you're definitely borderline here. Do you use furmark to check stability? | |
ID: 7887 | Rating: 0 | rate: / Reply Quote | |
Hi ETA, | |
ID: 7930 | Rating: 0 | rate: / Reply Quote | |
Well, that makes quite some sense. Actually I ran my stability tests in a rather similar way: I used 3D Mark to find the point where things broke, backed off a bit and ran for a while and then backed off a bit more and used these settings successfully for GPU-Grid. Back to 1 successful WU after another, so it seems as though heat was the main issue.... Glad it works again! And although I may not telling you naything new here: it was not "heat", it was the "combination of clock speed and heat" ;) The higher the temperature the smaller the maximum stable frequency becomes. So at 50 or 100 MHz less you would likely still have been fine at those higher temps. BTW: I'm running an ATI 4870 at Milkyway now and there's this nice tool "ATI Tray Tools", which also has a built-in artefact tester. It's much more convenient to use than e.g. 3D Mark and although it doesn't generate as much stress as FurMark (that thing could kill my card..) it's almost in line with the temperatures I get when running MW, so it *may* be a good test. Don't know if the tool also runs on nVidia, though. RivaTuner surely does run on ATI :D MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 7933 | Rating: 0 | rate: / Reply Quote | |
Because of the likelihood of GPUs being overclocked to the point of producing non-crash errors, I would think that most projects would be well served by running with at least a quorum of 2 so that some flipped bits don't end up distorting the science results. | |
ID: 7935 | Rating: 0 | rate: / Reply Quote | |
Definitely understand and agree, but in my world (lol) the OC is a given....so the only change was the reduction in cooling. I certainly understand your point though....! That was the purpose of the "+" in the +55C above....it's a conservative figure...these clocks will run fine above that...they'll run at +60C, but I'm not quite sure where the breaking point is. However, they are clocked high enough that they won't perform at the temps a lot of folks get on air...but that's one of the benefits of water...higher clocks and cooler temps....hopefully leading to greater longevity, cuz atm I plan to run these suckers until the wheels fall off! Mike, You are definitely right, but as you already know...it's near impossible for an average-joe like me to accomplish that...and the truth is the same even at factory clocks as you posted. So, until I have reason to think/feel/believe that what I am doing is somehow skewing the results....I'm just going to keep moving forward. Hopefully, as this and other gpu projects advance--They'll be able to tell and let us know what's what. Right now I think we all have to rely upon them to determine the overall quality of the data...and if something's wrong---find out what it is and let us all know. This Gpu stuff is still relatively new, so there's little doubt that things will be learned about it in the future! Probably the main reason that I have stuck with GpuGrid vs. say, F@H, is because it was and still is a new advancement in the DC Gpu world--I like that. Even though the original WUs were not for "science" directly....they were for the overall greater good of DC and therefore "science" over the long haul! Somebody needs to get this tech up, running and sorted out....that is what GpuGrid has been doing and I like being a part of that. Now we have seti, MW, etc and others will follow....DC will grow exponentially. | |
ID: 7937 | Rating: 0 | rate: / Reply Quote | |
Frankly, given the current interest in GPU-computing, I'm surprised that Nvidia and ATI don't include tools that let you test the accuracy of your calculations so that consumers can make wise choices when overclocking. Right now, we're flying blind. I'm not that surprised. Do you think it's really in NV or ATIs interest to give us some tool to proof that hardware is defective? There have been cases where new games (e.g. Doom 3) appeared and some factory OC'ed cards failed them. If even games occasionally reveal such faults, what would the failure rates (and associated RAM costs) be if there was a proper test tool? Who of the big two would afford to be first and take the sales hit due to bad press? We as a community, however, would certainly want better test tools and more of a general consciousness towards this problem. And I don't think we're flying totally blind right now.. it's not that bad ;) We do have 3D Mark to find upper limits of stability. We have FurMark to fry our cards. We have the artefact tester of ATI Tray Tool and maybe others. Sure, these don't execute exactly the same code as the projects. But taking this logic a bit further a specialized test program supplied by NV or ATI would also not be sufficient, as it wouldn't run the same code either. In another thread GDF said he's quite confident in the error detection mechanisms of GPU-Grid. The part I understand and remember is that small errors will be corrected in future iterations due to the way atomic forces work, whereas large errors lead to detectable failures. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 7939 | Rating: 0 | rate: / Reply Quote | |
Do you think it's really in NV or ATIs interest to give us some tool to proof that hardware is defective? That depends on how seriously they want to push their graphics cards as general-purpose supercomputing engines. You have an excellent point, however. I suppose it would be downright idiotic for ATI or NV to provide such a tool. For that matter, if you read that GPU power use report (referenced in another thread here), it's clear the boards are built and marketed as 3D game boards (duh!) which won't run full out in actual usage. ATI even had to go so far as to have the Catalyst driver detect if a specific benchmark was running because when ran full out, some of their boards' voltage regulators were overheating -- something that doesn't happen in real life in games or DC computing. Fortunately, with the CUDA libraries, it would be really easy to write a burn-in program that would run the GPU at 100%. | |
ID: 7946 | Rating: 0 | rate: / Reply Quote | |
I don't understand this error! See example below, it's the same WU! | |
ID: 7953 | Rating: 0 | rate: / Reply Quote | |
I don't understand this error! There isn't much to understand here anyway. It's just a very generic message to tell you some error happened during the computation. I'll try to update boinc, NV forceware and try again. Good choice. But also consider lowering your clock speed and / or speeding up your fan. Although OnTheNet runs his cards at a higher clock speed he may just have gotten a very good one or he may have superb cooling (water). Your card is likely factory overclocked: 1.25 Ghz is the standard clock for a GTX 260, whereas yours runs at 1.40 GHz. This may cause problems. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 7967 | Rating: 0 | rate: / Reply Quote | |
ATI even had to go so far as to have the Catalyst driver detect if a specific benchmark was running because when ran full out, some of their boards' voltage regulators were overheating -- something that doesn't happen in real life in games or DC computing. That's the (in)famous FurMark. I tried it last Friday on my new 4870 with an Accelero S1 and 2 x 120 mm fans. Idle: GPU ~40°C (without downclocking or clock gating), power supply fan: 1100 rpm (only BOINC load on cpu) Milkyway@GPU: ~55°C, PSU fan 1300 rpm So MW is working the system quite hard already. GPU fans were running ~1000 rpm, as well as the case and CPU fans. Now when I launched FurMark it stayed around 80°C for some time, but then temperatures started to rise and wouldn't stop. I hit 98°C on the GPU, PS fan 1500 rpm and had set all other fans to 1500 rpm, opened the front panel etc. And the temperature would have continued rising, so I aborted the test. Incredible! I'm sure I could kill my card with FurMark.. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 7968 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : restart.coor....