Message boards : Graphics cards (GPUs) : Video Card Longevity
Author | Message |
---|---|
Does anyone have any first hand experience with failures related to 24/7 crunching? Overclocked or stock? | |
ID: 5638 | Rating: 0 | rate:
![]() ![]() ![]() | |
Of course I got failures with overclocking, immediately after test-parcour (which should be run in any case before working on WUs. In my opinion this is not a matter of 24/7, but of stability and time you invest in testing and adjusting your clock rates and fans (not forget the case fans !). | |
ID: 5642 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks for the response, I guess I should had worded my query differently. | |
ID: 5643 | Rating: 0 | rate:
![]() ![]() ![]() | |
You mean irreparable blackout ? Or temporary disfunction ? Just crunching (shader usage), or up to collapse of the 2D-function ? | |
ID: 5645 | Rating: 0 | rate:
![]() ![]() ![]() | |
Ruin of the card to the point of being unusable. | |
ID: 5646 | Rating: 0 | rate:
![]() ![]() ![]() | |
The trouble is the lack of context. | |
ID: 5647 | Rating: 0 | rate:
![]() ![]() ![]() | |
The trouble is the lack of context. Can I assume that you have no failures to discuss? ____________ mike | |
ID: 5650 | Rating: 0 | rate:
![]() ![]() ![]() | |
If it helps, I have several cards here that have run 24/7 on GPUGrid since July last year with no failures. | |
ID: 5651 | Rating: 0 | rate:
![]() ![]() ![]() | |
Can I assume that you have no failures to discuss? No I do not. Do you? What I was saying is that some report failures and blame GPU Grid, BOINC, etc. when the problem is that because these programs run the system at full speed for long periods of time which will, in fact, stress the system upon which they are run. If there is a problem, or a weakness in the system, the use of a program such as BOINC is going to probably push the system over the brink ... is that the fault of BOINC? Not really ... Just like race cars lose engines through explosions and other catastrophic events because they are pushed to the edge, any minor flaw or event will cause failure, so it is with BOINC ... ____________ | |
ID: 5653 | Rating: 0 | rate:
![]() ![]() ![]() | |
If it helps, I have several cards here that have run 24/7 on GPUGrid since July last year with no failures. This is what I find everywhere that I have asked. I had assumed that there would be no big issues and had stated so....but was told[without foundation] that the longevity would be severely shortened by crunching. Thanks for your input. ____________ mike | |
ID: 5655 | Rating: 0 | rate:
![]() ![]() ![]() | |
Can I assume that you have no failures to discuss? Thank you for your input. ____________ mike | |
ID: 5656 | Rating: 0 | rate:
![]() ![]() ![]() | |
Guys, this is a serious topic. I could talk a lot about this, but will try to stay focussed. Feel free to ask further questions! | |
ID: 5657 | Rating: 0 | rate:
![]() ![]() ![]() | |
Good explanation ... | |
ID: 5663 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks ETA ... that was very interesting. | |
ID: 5673 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi Paul, | |
ID: 5690 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi Paul, I have the opposite problem, I can't include things without mentioning them ... why lots of my posts tend to the long ... Hardening can be combinations of technologies from the design of the structures so that an impinging ray will not be able to create enough of a charge change to make an internal change, to coatings to absorb or negate the ray. But, what I was trying to get at is that not only can a ray cause a soft error but it can also create a local voltage "spike" that it causes the catastrophic failure due to the presence of the latent defect ... which, sans the event, would have caused the failure in the future due to the normal wear and tear we had been discussing. But you are correct that I was not attempting to make a point that flipping a bit due to soft error change by a cosmic/gamma-ray will cause a failure ... There are several, the most common is "Turning lemons into lemonade" ... of "If life hands you lemons, make lemonade" ... Thinking about that, life usually hands me onions and I am not sure that learning how to cry really makes it as an aphorism ... but that is just me ... ____________ | |
ID: 5692 | Rating: 0 | rate:
![]() ![]() ![]() | |
...life usually hands me onions and I am not sure that learning how to cry really makes it as an aphorism ... but that is just me ... Hopefully, at least sometimes they are sweet Vidalia onions. :) And thanks to both you and MrS for the excellent discussions on this topic. Gives me something to think about with my 9600GSO (tends to run constantly in the low 70's celsius)... | |
ID: 5693 | Rating: 0 | rate:
![]() ![]() ![]() | |
...life usually hands me onions and I am not sure that learning how to cry really makes it as an aphorism ... but that is just me ... Except I hate onions ... All kinds of onions ... And your temperature, as I recall, is in the nominal zone as we figure these things ... mine is at 78, of course I let the room get warm so I am sure that drove it up some ... Making me even happier is Virtual Prairie has just issued some new work!!! :) And I am on track to have Cosmology to goal on the 25th ... and my Mac Pro is raising ABC on its own (while still doing other projects) nicely so that it looks like I should easily be able to make that goal by mid to late Feb, even with the detour of SIMAP at the end of the month ... which I am going to make a real focus for that one week ... and new applications promised here ... things are really looking up ... ____________ | |
ID: 5694 | Rating: 0 | rate:
![]() ![]() ![]() | |
one main thing is to watch out on how hot it gets. that can kill a system by overheating the parts. but really on this topic. no it wont happen. | |
ID: 6086 | Rating: 0 | rate:
![]() ![]() ![]() | |
May I kindly redirect your attention to this post? What you're talking about is failure type number (2), which is indeed not our main concern. | |
ID: 6136 | Rating: 0 | rate:
![]() ![]() ![]() | |
Actually, number 2 might be more of a problem than you might think. The GT200 is a very hardy chip and can take some serious heat. However, I'll be receiving my THIRD GTX 260 (192) tomorrow if UPS cooperates. Two have failed me so far under warranty. | |
ID: 9286 | Rating: 0 | rate:
![]() ![]() ![]() | |
Heat is a common problem for the lifespan of any part, and failures are not as many people think selden or rare but they actually are common. | |
ID: 9293 | Rating: 0 | rate:
![]() ![]() ![]() | |
Another discussion is about harddrives and their temps, we think the lower the better but i have read documents from google that it actually seems to be more the moderate temps (between 45 and 65) where they get better and life longer. I also found out myself when i still worked as IT person on medium/large companies that spinning up a drive frequently does more damage to them then letting them run 24/7. YOu are seeing the effects of two factors, thermal cycling which leads to expansion and contraction effects which can induce failures and inrush currents which cause other failure modes (I talked of this in my part of the failure mode discussion in the other referenced thread). | |
ID: 9304 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yes and is nice to see such very handy info, because we allways am concerned about our hardware. In fact sometimes a little bit too much :) | |
ID: 9379 | Rating: 0 | rate:
![]() ![]() ![]() | |
Two of my GTX280 died crunching 24/7 (Gpugrid/Folding/SETI) with a 16% OC (only clock and shaders, not memory). They last aprox. two month each. Since then, I do not OC my GPUs and only crunch about 10 hours/day | |
ID: 9383 | Rating: 0 | rate:
![]() ![]() ![]() | |
I lost an ATI HD4850x2 at stock due to excessive temperatures. | |
ID: 9508 | Rating: 0 | rate:
![]() ![]() ![]() | |
Well you said it right excessive temps are at longer periods a disaster some cards need better cooling then provided by manufactors. | |
ID: 9511 | Rating: 0 | rate:
![]() ![]() ![]() | |
Jeremy wrote: Actually, number 2 might be more of a problem than you might think. Yes and no.. well, it depends. The "normal" failure rate of graphics cards seems to be higher than I expected, even without BOINC. Your first card seems to be one of them. What I mean by type 2 is a sudden failure with no apparent reason (overclock at stock voltage is not a proper reason, as I explained above). What you describe for your 2nd card (fan failed) could well be attributed to this type, but I'd tend to assign the mechanism to type 3: it's heat damage, greatly accelerated compared to the normal decay. Admittedly, when I talk about type 3 I have normal temperatures in mind, i.e. 50 - 80 °C. uBronan, a heat sink which is not mounted properly is rather similar to a fan suddenly failing. It's obviously bad and has to be avoided, but what I'm talking about is what happens in the absence of such failures, in a system where the cooling system works as expected. Regarding the hot spots on mainboards: not all of these components are silicon chips. for example a dumb inductor coil can tolerate a much higher temperature. And you can actually manufacture power electronics, which feature much more crude structures and are therefore much less prone to damage than the very fine structures of current cpus and gpus. So just that a component is 125°C hot does not neccessarily tell you that this is wrong or too bad. Regarding hdds: sorry, but temperatures up to 65°C are likely going to kill the disk!! Take a look, most are specified up to 60°C. German c't magazine once showed a photo of an old 10k rpm SCSI disk after the fan failed.. some plastic had molten and the entire hdd had turned into an unshapely something [slightly exagerated]. I also read about this Google study and while their conclusion "we see lower failure rates at mid 40°C than at mid 30°C" is right, it is not so clear what this means. The advantage of their study is that they average over many different systems, so they can gather lots of data. The drawback, however, is that they average over many different systems. There's at least one question which can not be answered easily: are the drives running at lower temperatures mounted in server chassis with cooling.. in critical systems, which experience a much higher load on their hdds? There could be more of such factors, which influence the result and persuade us to draw wrong conclusions, when we ignore the heterogenous landscape of Googles server farm. What I think is happening: the *old* rule of "lower temp is better" still applies, but in the range of mid-40°C we are relatively safe from thermally induced hdd failures. Thus other factors start to dominate the failure rates, which may coincidentally seem linked to the hdd temperature, but which may actually be linked to the hdd tpye / class / usage patterns. But i allways tell people to leave the machine on when they know they need it again i a few hours. But don't forget that nowadays all hdds have fluid-dynamic bearings (I imagine it quite difficult to do permanent damage to a fluid) and that pc component costs went down, whereas power costs went up, as well as pc power consumption. However, thermal cycling is of cource still a factor. Like my previous card which is a nvdia 6600 Gt which is a notorious hothead (up to 160 C) i tweaked it with a watercooler and got it at stressed at 68 C and believe me not many are able to get it that low. Well, mine ran at ~50°C idle and ~70°C load with a silent "Nv Silencer" style cooler. And emergency shutdown is set by the NV driver somewhere around 120 -130°C. Maybe you saw 160F mentioned somewhere? MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 9587 | Rating: 0 | rate:
![]() ![]() ![]() | |
The card which is a real hothead was running at temps exceeding 110 C | |
ID: 9591 | Rating: 0 | rate:
![]() ![]() ![]() | |
@ JockMacMad TSBT | |
ID: 9596 | Rating: 0 | rate:
![]() ![]() ![]() | |
Andrew, Overall our experiments confirm previously reported temperature effects only for the high end of our temperature range(*) and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. I think that's quite what I've been saying :) (*) Note that the high end of the temperature spectrum starts at 45°C for them and ends at 50°C. There the error rate rises, but the data quickly becomes noisy due to low statistics (large error bars). Regarding that 6600GT.. well, I can't accuse them of lying without futher knowledge. They may very well have had some reason to state that you could have seen even higher temps without immediate chip failure. I think those chips were produced on the 110 nm node, which means much larger and more robust structures, i.e. if you move one atom it causes less of an effect. Here's some nice information: most 6600GTs running in the 60 - 80°C range under load and a statement that 127°C is the limit where the NV driver does the emergency shutdown. Do you know what? "You could have seen higher temps" means "emergency shut down happens later". Which is not lying, but totally different from "110°C is fine" :D MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 9601 | Rating: 0 | rate:
![]() ![]() ![]() | |
Well lol yea i have been reading it several times also. | |
ID: 9660 | Rating: 0 | rate:
![]() ![]() ![]() | |
Sorry ET after i started up the old beast i saw that i misinformed about the old card its a 6800 GT nvidia from Gigabyte | |
ID: 9704 | Rating: 0 | rate:
![]() ![]() ![]() | |
Does anyone have any first hand experience with failures related to 24/7 crunching? Overclocked or stock? I bought a gtx 280 in August 2008 and burned it in March 2009. So thats 6 months of stock 24/7 crunching on GPU grid, the fan was set to automatic, don't know if it would made a difference if I've set it to a higher manual fan speed. I remember the tempratures were below the tolerable levels. The relevant thread is here: http://www.gpugrid.net/forum_thread.php?id=829&nowrap=true#7338 So yes, GPU failures are well and real, I've seen enough posts from other GPU owners. However, its very rare to hear of CPU failures. I guess there are safe guards on CPUs that are more advanced than the GPU. | |
ID: 10318 | Rating: 0 | rate:
![]() ![]() ![]() | |
So far, I've had 1 GTX260 out of 10 fail after 4 months of crunching. The fan bearings were totally wore out. It took 5 weeks to get it's replacement.As allways my stuff runs stock speeds. | |
ID: 10319 | Rating: 0 | rate:
![]() ![]() ![]() | |
I use XFX brand cards since they give a lifetime warranty if you register them. If mine burns out, I just do an replacement with them, though I've yet to have one die anyway. Keeping any video card cool is the other major factor. Just like your CPU, the cooler you keep your GPU, the less likely you are to see failures or errors. | |
ID: 10381 | Rating: 0 | rate:
![]() ![]() ![]() | |
Looks like my 9600 gt is showing sings of break down as well | |
ID: 10433 | Rating: 0 | rate:
![]() ![]() ![]() | |
So i guess vc die from dpc projects Saying that will scare others away. They don't die from doing dc as they would have failed anyway. If there is a problem with a card it will show up faster if stressed harder yes but to claim projects like GPUgrid kill cards is wrong. The best safeguard is to buy from good companies who will help solve problems. I've only dealt with evga and they've been good to me but I can't comment on other places. Bob | |
ID: 10436 | Rating: 0 | rate:
![]() ![]() ![]() | |
| |
ID: 10520 | Rating: 0 | rate:
![]() ![]() ![]() | |
lution the manufacturers would have to offer special 24/7-versions of their cards: slightly lower clocks, slightly lower voltages, maybe a better cooling solution and the fan setting biased towards cooling rather than noise. Such cards could be used for 24/7 crunching.. but who would buy them? More expensive, slower and likely louder! Isn't that called the nVidia Tesla? | |
ID: 10521 | Rating: 0 | rate:
![]() ![]() ![]() | |
Almost. The Teslas cost 1000 to 2000$ more, whereas I'm talking about 10 to 20$ more. I suppose what makes the Teslas really expensive is the extensive testing and "guaranted" functionality (if there is something like that for chips at all). They wouldn't neccessarily need that for "heavy duty GP-GPUs". | |
ID: 10522 | Rating: 0 | rate:
![]() ![]() ![]() | |
Radiation | |
ID: 10759 | Rating: 0 | rate:
![]() ![]() ![]() | |
High energy particles cause transient errors by impact ionization. They don't generally cause permanent erros: the mass/energy difference between these particles and the atoms of your chip are too large for significant momentum transfers. Therefore they can (temporarly) kick electrons out of their bindings, but they can hardly move atoms. | |
ID: 10765 | Rating: 0 | rate:
![]() ![]() ![]() | |
I agree that most (radiation) impacts do not cause permanent damage - a restart and you are up and running again. | |
ID: 10804 | Rating: 0 | rate:
![]() ![]() ![]() | |
Well get to work then. | |
ID: 10811 | Rating: 0 | rate:
![]() ![]() ![]() | |
Theoretically speaking however (and only in very rare circumstances) Neutrons can cause Permanent damage (and yes they can move atoms, and sometimes not just one). Unfortunately however, radiation does not have to move atoms to cause permanent damage, just the odd atomic bond; causing material degradation. You're right, neutrons can move atoms. good that there aren't too many of them in the cosmic ray mix :) And you wouldn't have to reboot on every transient error: the fault might not lead to any disturbing consequences. Furthermore I heard a talk about 3 years ago where the Prof. said Intels core logic is entirely "radiation hardened" to the point where they can detect 2-bit errors and correct 1-bit errors. Don't quote me on the these numbers, though.. it's been quite some time. Interesting that you mention breaking bonds. This is actually what causes the slow degradation of chips over time, it's just not mainly caused by cosmic radiation. The defects at the Si-SiO2 interface (or now Si-HfOx) are passivated by hydrogen atoms. Over time the occasional highly energetic electrons (the boltzmann tail, or from the substrate) kick these light hydrogen atoms out and a dangling bond is created. This is a "trap state" for charge carriers. Once such a trap contains charge the transistor operation is influenced (the threshold voltage shifts), which can only be bad in either direction. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 10833 | Rating: 0 | rate:
![]() ![]() ![]() | |
I was not implying that neutrons were part of the cosmic radiation! Most sources are quite terrestrial. They are also very rare, and are usually produced by other rare particle bombardments. The biggest single radiation concern for humans is Radon, as it is in the stone of many buildings, work benches, ornaments and the rocks beneath us. I’m sure the ionising radiation that directly or indirectly results from Radon causes computer problems too. | |
ID: 10844 | Rating: 0 | rate:
![]() ![]() ![]() | |
My bad, I read too much into that! Regarding computer problems due to radioactivity: fortunately we don't need to worry about the alphas here, as they can not even penetrate paper. Betas and Gammas, on the other hand, can cause transient errors by ionization if the design is not radiation hardened or there are too many of them (but you likely wouldn't care much as you'd sit in the middle of a fission reactor :D). But at a few MeV they can probably create dangling bonds and thus lead to component decay. | |
ID: 10861 | Rating: 0 | rate:
![]() ![]() ![]() | |
The "shrinking with age" regarding telomerase probably has little effect on how long we live currently given upper population life expectancies of around 85 years for Japanese women. Essentially, this repair process and shrinking is related to the "Hayflick Limit" in cell division, which places a finite limit on natural human life span at around 250 years (when "shrinking" results in lengths too short for the division to occur properly). Research on telomerase (and related issues) has been ongoing for three or more decades, including some work on cell division in some cancers. | |
ID: 10989 | Rating: 0 | rate:
![]() ![]() ![]() | |
Telomerase repairs damaged DNA, but the Telomerase gene resides close to the ends of chromosomes (the end region is called a telomere). So my point was that when chromosomes shrink overall Telomerase production is reduced in the body, and non-existent in some cells. Without Telomerase DNA stays damaged, so there is a greater risk of Cancer and other illnesses. | |
ID: 11049 | Rating: 0 | rate:
![]() ![]() ![]() | |
...Without Telomerase DNA stays damaged, so there is a greater risk of Cancer and other illnesses. Just an FYI...see here and here for example.
Dust is definitely not the only problem. I saw a system about 5 years ago that had the same "pop/bang" noise problem...opened the case only to find a nice colony of ants (some rather toasted)! I doubt even the fine mesh would have kept them out. | |
ID: 11204 | Rating: 0 | rate:
![]() ![]() ![]() | |
Both these research teams took the inverted smart approach. If something is essential for life, they want to kill it. They know they will take out a few cancer cells on the way, be able to publish in a few obscure journals, and further their career. If they get really lucky a drug company will develop some sort of anti-telomerase, to slowly kill people with, and they will get a bit of money out of it. Drug companies don’t do cures! Unfortunately this sort of research undermines science and interferes with the work of descent scientist that are really trying to do something positive. | |
ID: 11242 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hear, Hear. | |
ID: 11244 | Rating: 0 | rate:
![]() ![]() ![]() | |
Both these research teams took the inverted smart approach. If something is essential for life, they want to kill it. Neither of these were research teams doing work on telomerase and cancer. Both were review articles (from 1996 and 2001) demonstrating that at least theraputic research in this area has been going on for quite some time and were provided by me to counter your statement that "It’s such a pity nobody is studying this Cure for All Cancers Solution". ...publish in a few obscure journals... Though I probably wouldn't really call "Scientific American" a journal (the first review piece), it is hardly obscure. "Human Molecular Genetics" (the second review piece) is a prominent journal in the area. Unfortunately this sort of research undermines science and interferes with the work of descent scientist that are really trying to do something positive. I am really at a loss with this kind of statement. Are you really suggesting that the U.S. and Japanese researchers from the second article are not descent scientists? Anyway, this has gone way off topic, so I apologize for Hijacking the thread. | |
ID: 11254 | Rating: 0 | rate:
![]() ![]() ![]() | |
For Scott Brown only, everyone else skip to the last paragraph! | |
ID: 11265 | Rating: 0 | rate:
![]() ![]() ![]() | |
Is anyone working on a processor (CPU or GPU) that can perform a self diagnostic test and do an instruction set work around, like a Bios patch? I’m sure space agencies and aircraft manufacturers would be very interested. There are transient and permanent errors. Transient ones are errors which happen due to whatever reason (e.g. ionizing radiation) and disappear shortly afterwards. As I stated before I believe Intels designs are radiation hardened to the point where they can detect 2 bit transient errors and correct 1 bit errors, in both, the core and the cache. For regular use this is quite good already. The permanent errors are more challenging. If they appear in the cache you can blend the affected cache line out. The fat CPUs (IBM Power, Itanium, Sparc) can certainly do this, whereas for desktop chips I think it's a one time action - before the chip leaves the factory. Permanent errors within the logic parts of the chip are currently unrepairable. One could think about disabling certain blocks after failures, but there's not much redundancy in cpus, so you can't take much away so that they still work. It's different for GPUs: disabling individual shader clusters should be possible by software / bios, maybe requiring little tweaks. Another option is to use redundant hardware from the beginning on. This is fine for safety-critical markets (space, military, airplanes, cars etc.), but wouldn't work in the consumer sector. Who'd buy a dual core for the price of a quad, just so he can still have 2 working cores even if 2 of them fail? We'd want to go 4-3-2 instead. An interesting option are FPGAs, reconfigurable logic. With this stuff you could build chips which can adapt to the situation and which could repair themselves. The problem is that you need 10 times the transistors and you can only run the design at about 1/10th the frequency. To put this into perspective: with 130 nm tech you could build a regular Athlon XP at 2 GHz. Or you could build a Pentium 1 at 200 MHz, something already available at the 350 nm node. It's a very interesting research area, but no option for the consumer market. Otherwise.. IBM is researching such stuff, but I don't know how far they got by now. And you can be sure Intel's in the boat as well ;) MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 11394 | Rating: 0 | rate:
![]() ![]() ![]() | |
A question to you | |
ID: 11461 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi Ross | |
ID: 11465 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi | |
ID: 11466 | Rating: 0 | rate:
![]() ![]() ![]() | |
OK, let us try to get this right now. | |
ID: 11470 | Rating: 0 | rate:
![]() ![]() ![]() | |
Let's try not to turn this sticky thread on "Video Card Longevity" into a "I need help with WU xyz" thread. | |
ID: 11479 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks for the on-topic reply!
I would not like to lose 2 cores either (in some sort of mirror failover solution), but I think it might be possible for the consumer market to have a dead core work around - AMD do this in the factory, making their quad cores triple cores, or dual cores, when they are not quite up to scratch. We know that their approach is not quite a permanent one; people have been able to re-enable the cores on some motherboards. So whatever AMD did could in theory be used subsequent to shipping when a core fails. For business this could be a great advantage. From experience, replacing a failed system can be a logistical nightmare, particularly for small businesses. Usually lost hours = lost income. Losses would be reduced if a CPU replacement could be planned and scheduled. When 6 and 8 cores become more common place for CPU’s the need to replace the CPU might not actually be so urgent, and the CPU would still hold some value; a CPU with 5 working cores is better than a similar quad core CPU with all 4 cores working! I was also thinking that if you could set/reduce the clock speeds of cores independently it could offer some sort of fallback advantage. For example, if one of my Phenom II 940 cores struggled for reliability at it’s native 3GHz, and I could reduce it to 1800MHz, or even 800MHz – just by setting it’s multiplier separately – it would be better than having to underclock all 4 cores, or immediately having to replace the CPU. I like the idea of a software work around / solution for erroneous shaders. NVidia would do us all a big favour if they developed a proper diagnostic utility, never mind the work around! | |
ID: 11567 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi, I would not like to lose 2 cores either (in some sort of mirror failover solution), but I think it might be possible for the consumer market to have a dead core work around - AMD do this in the factory, making their quad cores triple cores, or dual cores, when they are not quite up to scratch. We know that their approach is not quite a permanent one; people have been able to re-enable the cores on some motherboards. So whatever AMD did could in theory be used subsequent to shipping when a core fails. Testing at the factory is done with external probe stations prior to packaging (not to waste money on defect chips). This can not be repeated directly at home ;) (btw, these tests are expensive although each one only requires several seconds. I suppose the cost is mainly due to the time it takes that expensive device to measure the chip.. which wouldn't matter for us) Therefore such a test would have to be software based. I see at least 2 major problems with that: 1. Whatever you put into the chip, you have to test it. Such software could reveal the chip architecture completely, just due to the way it's doing the tests. Software can be hacked and / or reverse engineered and that's something no chip maker would want to risk. It would open up the door for all sorts of things: full or partly copies, bad press due to discovered design errors, software deliberately targeted to be slow on your hardware (hint: compiler). 2. You'd be executing code on your cpu to test your cpu. How could you know the results are reliable? It would be a shame to get the message "3 of 4 cores defect" due to a minor fault somewhere else. Possible solution: dedicate some specialized logic with self diagnostic functions and error checking for such tests. For business this could be a great advantage. That's why the "big iron" servers have RAS features, hot swap of almost everything and such :) I like the idea of a software work around / solution for erroneous shaders. Yes, that would be very nice. However, seeing how their software version struggles with driver bugs I'm not very confident anything like that is going to happen anytime soon. The problem of "revealing the architecture" would likely be less severe in this case, as communication with the GPU is done by the driver anyway. If such a tool is released i'd imagine them to be careful, i.e. "If you get errors there's a problem [not neccessarily caused by defect hardware] and you may get wrong results under CUDA. But we don't know your exact code and therefore we can not guarantee you that there is not a hardware error just because we didn't find any." I was also thinking that if you could set/reduce the clock speeds of cores independently it could offer some sort of fallback advantage. For example, if one of my Phenom II 940 cores struggled for reliability at it’s native 3GHz, and I could reduce it to 1800MHz, or even 800MHz – just by setting it’s multiplier separately – it would be better than having to underclock all 4 cores, or immediately having to replace the CPU. Let's take this one step further: the clock speed of chips is limited by the slowest parts, or more exactly paths which signals must travel within one clock cycle. If they arrive too late an error is likely produced. It's really tough to guess what the slowest paths through all your 100 millions of transistors will be, given the vast amount of possible instruction combinations, states, error handling, interrupts etc. But the manufacturers do have some idea. So why not design a chip with some test circuitry with deliberately long signal run times and sophisticated error detection, somewhere near the known hot spots. Now you could lower the operating voltage just to the point where you're starting to see errors (and increase again just above the threshold). That would reduce average power consumption a lot and would help to choose proper turbo modes for i7-like designs. It wouldn't help against permanent errors, but in the case of your 940 the bios could have raised the voltage of that core a little (within safety margin). MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 11638 | Rating: 0 | rate:
![]() ![]() ![]() | |
Therefore such a test would have to be software based. As you said later, So why not design a chip with some test circuitry Perhaps an on-die instruction set for testing, and if required automatically modifying voltages, frequencies or even disabling Cache banks or a core? A small program could receive reports, analyse them and calculate ideal frequencies automatically. These could be saved to the system drive or bios and reloaded on restart. A sort of built in CPU Optimalization kit. I still like the idea of independent frequencies and voltages for CPU cores. Most of the time people dont actually use all 4 cores or a quad, so if the CPU could raise and lower the frequencies independently, or even turn one or more cores off altogether, it would save energy, and therefore the overall cost of the system during it life. Unless you are crunching, playing games or using some serious software, there are few times when you would notice the difference in a quad core at 3.3GHz or 800MHz (8MB Cache). I often forget and have to check and see what my clock is set at – If the system gets loud, I turn it down. If the cores could independently rise to the occasion, even when you are using intensive CPU applications, you would be saving on electric (temperatures would be lower, as would the noise)! I’m not sure Intel would go for this, as their cores are paired and it might reveal some underlying limitation (until 8 or more cores are mainstream, then it would be less obvious and less of an issue). If these ideas were applied to graphics cards, it would save a small fortune in Electric. Even GPUGrid does not always use all the processing power of the graphics cards. I think folding at home probably comes a lot closer, but some GPU Crunching clients such as Aqua often use substantially less (it seems to vary with different tasks – similar to a computer game). GPU’s are far from Green! | |
ID: 11650 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm a bit confused by your post. That's actually just what "Power Now!", Cool & Quiet, Speed Step etc. are doing. They're not perfect yet, but they do adjust clock speeds and voltages on the fly according to demand, and in the newest incarnations also independently for individual cores. Intel heavily uses the thermal headroom under single / low-threaded load for their turbo mode. So it's not perfect yet, but we're getting there. And now that almost all high performance chips 8CPUs, GPUs) are power limited these power management features are quickly becoming ever more important. | |
ID: 11693 | Rating: 0 | rate:
![]() ![]() ![]() | |
Well Ladies and Gents, | |
ID: 11772 | Rating: 0 | rate:
![]() ![]() ![]() | |
In less than 1 years time I've RMA'ed 3 GTX 260's already and right now I'm looking at RMA'ing 5 more GTX 260's (4 BFG's & 1 EVGA) plus 1 possibly 2 GTX 295's. Oh and for good measure throw in a Sapphire 4850 X2 & Sapphire 4870 that are going to need to be RMA'ed. | |
ID: 11773 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hey Nognlite, | |
ID: 11790 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm a bit confused by your post. That's actually just what "Power Now!", Cool & Quiet, Speed Step etc. are doing. MrS Well yes, up to a point, but as you went on to say, they are not perfect! I was trying to be general, as there are so many energy saving variations used in different CPUs, but very few are combined sufficiently. Perhaps Intel’s Enhanced Speed Step is the closest to what I am suggesting, but in itself it does not offer everything. Many CPU’s only have 2 speeds. Why not 10 or 30? If motherboards can be clocked in 1MHz steps, why not CPU’s? Why develop so many different technologies separately, rather than combining, centralising, streamlining and reducing manufacturing costs. If the technology does not significantly increase production costs and is worthwhile having, have it in all the CPU’s rather than 100’s of slightly different CPU types. In many areas Intel’s biggest rival is Intel; they make so many chips that many are competing directly against each other. Flooding the market with every possible combination of technology is just plain thick. Why only reduce the multiplier and Voltage? Why not the FSB as well? If the CPU is built to support it the motherboard designs will follow as there is descent competition there. Why send power to the Cache when it’s doing nothing? Why send power to all the CPU cores when only one is in use? Why charge a small fortune for a slightly more energy efficient CPU (SLARP, L5420 vs SLANV, E5420)? Especially when manufacturing costs are the same. Why use one energy saving feature in one CPU but a different feature in another CPU when both could be used? In many ways it’s not so much about being clever, just not being so stupid. To be fair to both Intel and AMD, there have been excellent improvements over the last 5 years: My Phenom II 940 offers three steps (3GHz, 1800MHz and 800MHz) which is one of the main reasons I purchased it. This was a big improvement over my previous Phenom 9750 (2.4GHz and 1.8GHZ). The E2160 (and similar) only use 8Watts when idle, and many of the systems they inhabit typically operate at about 50Watts –much less than top GPU cards! Mind you, these are exceptions rather than the rule. Many speed steps were none too special – stepping down from 2.13GHz to 1.8GHz was a bit of a lame gesture by Intel! My opinion is that if it’s not in use, it does not need power. So if it is using power that it does not need, it has been poorly designed. they do adjust clock speeds and voltages on the fly according to demand, and in the newest incarnations also independently for individual cores. OK, I was not aware the latest server cores could be independently stepped down in speed. I hope the motherboard manufacturers keep up; I recently worked on several desktop systems that boasted energy efficient CPUs such as the E2160 (with C1E & EIST), only to see that the motherboard did not support speed stepping! Again this just smells of mismatched hardware/a stupid design flaw, but I do think the motherboard manufacturers need to make more of an effort - perhaps they are more to blame than AMD and Intel. And now that almost all high performance chips 8CPUs, GPUs) are power limited these power management features are quickly becoming ever more important. I agree; server farms are using more and more of the grids energy each year so they must look towards energy efficiency. Hopefully many of these server advancements will become readily available to the general consumer in the near future. Some of these advances come at a shocking price though, and the new CPU designs often seem to drop existing energy efficiency systems to incorporate the new ones rather than adding the new energy efficient technology. Presumably so they can compete against each other! Reminds me of the second wave Intel quad cores – clocked faster, but came with less cache, so there was only a slight improvement with some chips and it was difficult to choose which one was actually faster! Ditto for Hyper Threading which competed against faster clocked non HT cores. Talking about power management for GPUs: I've been complaining about this wasted power for a decade. Why can the same chips used in laptops be power efficient, downclocked and everything, whereas as soon as they're used in desktops they have to waste 10 - 60 W, even if they're doing nothing?! The answer is simply: because people don't care (as long as it doesn't hurt them too much) and because added hardware or driver features would cost more - and that's something people do care about. The General Public probably don’t think about the running costs as much as IT Pro’s do, but they really should. The lack of ‘green’ desktop GPU’s is a serious problem. Neither ATI or NVIDIA have bothered to produce a really green desktop GPU. It’s as though there is some sort of unspoken agreement to not compete on this front! Sooner or later ATI or NVIDIA will realise that people like me would rather go on a 2 weeks holiday with a new netbook than pay for two power greedy cards that cost almost as much to run as they do to buy! | |
ID: 11793 | Rating: 0 | rate:
![]() ![]() ![]() | |
You are in fact correct. I had to look at my records. My bad! | |
ID: 11804 | Rating: 0 | rate:
![]() ![]() ![]() | |
You may wish to set the fan speed manually either with the Evga utility or Ntune in Windows or Nvclock-Gtk in Linux. This would cut down on the heat issues. | |
ID: 11819 | Rating: 0 | rate:
![]() ![]() ![]() | |
Looks like I'll be RMA'ing 5 GTX 260's with the Clock Down Bug either today or tomorrow, I have a GTX 295 that will do the same thing off and on but hasn't for a few days so I'll keep it for now and see if the Proposed Fix GDF mentioned later this month fixes it permanently or not. As long as it doesn't get any worse I can live with it for a few days more ... :) | |
ID: 11821 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hey SKGiven, Why develop so many different technologies separately It's building upon each other. Each new generation of power saving technologies generally incorparates and superceeds the previous one. It does not suddenly replace it with something different. And some companies are licencing this stuff, but the big players are basically all developing the same stuff on their own - adapted to their special needs, of course. Why not the FSB as well? That's being done on notebooks. You wouldn't notice the difference on a desktop. Why send power to the Cache when it’s doing nothing? That's being done since some time, minor savings. Why send power to all the CPU cores when only one is in use? i7 is first to really shut them off. Why charge a small fortune for a slightly more energy efficient CPU (SLARP, L5420 vs SLANV, E5420)? Especially when manufacturing costs are the same. Because costs are not the same. Energy efficient CPUs run at lower voltages, which not all CPUs can do. To first approximation you can decide to sell a CPU as a normal 3 GHz chip or as a 2.5 GHz EE chip. The regular 2.5 GHz chip might not reach 3 GHz at all. Why use one energy saving feature in one CPU but a different feature in another CPU when both could be used? I don't think this is being done. The features mainly build upon each other. Exceptions are mobile Celerons, where Intel just removed power saving features (but doesn't include others), which I really dislike. And mobile chips generally get more refined power managing. I think this is mainly due to cost. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 11858 | Rating: 0 | rate:
![]() ![]() ![]() | |
After roughly 8 months of life my GTX280 card (EVGA) died. The good news is that I am in the 1 year warranty, the bad news is I missed the fine print. If you have EVGA cards you have to register them ON THEIR SITE to get long term warranty conversion (90 days from purchase, save receipt you also need that for an RMA). | |
ID: 12525 | Rating: 0 | rate:
![]() ![]() ![]() | |
The thing is, Manufacturers don’t want broken anythings back - they want to keep their money! So they setup so many obstacles for you and everyone else to negotiate that the majority of people will give up, or spend more time than it is worth trying to get some sort of partial refund or refurbished item (say after 3 months). Basically, the law says you can return an item that malfunctions for up to one year. Unfortunately, dubious politicians with unclear financial interests, have sought to undermine this with grey legislation. So you are left mulling through all sorts of dodgy terms and conditions – many of which are just meant to deter you; they have no legal ground, but serve to hold up the proceedings long enough for them to get away with it. By the time you (or say 20percent pf people like you) get through their many hoops there is a fair chance they will have been bought over, merged, renamed, re-launched or have gone under, and you will have another layer of it to go through. If you buy an item in a shop, hang onto the receipt and the packaging. If it breaks within a year, take it back and get a replacement or refund. If you buy online, you may have to deal with their terms and conditions, RTMs and of course the outfit not being there too long. To me it is worth the extra 5 or 10 percent to buy an expensive item in a local store with a good reputation. | |
ID: 12986 | Rating: 0 | rate:
![]() ![]() ![]() | |
I've RMA'ed probably 10 GTX 200 Series Cards & 2 ATI 48xx Cards this year alone & haven't had a bit of a problem getting the Manufactures to back their Card & send me a Replacement ASAP ... Of course as Paul said you have to read the Fine Print and Register them as soon as you get them or you may be SOL & have to eat the Costs. Most of my GTX are BFG's which have a Lifetime Warranty so those Cards are good to go for a long time if I choose to continue to run them. | |
ID: 13001 | Rating: 0 | rate:
![]() ![]() ![]() | |
Most of my GTX are BFG's which have a Lifetime Warranty so those Cards are good to go for a long time if I choose to continue to run them. that's why i got BFG for now on... Another tip to cool the beast.. if you live in a temperature sensitive climate say like the south of the US, fall and spring are perfect times to bring in the cool air at night.... i've seen 10C temp drops just by letting 52 F air into my room....which is normally 80F. ____________ ![]() I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/ | |
ID: 13006 | Rating: 0 | rate:
![]() ![]() ![]() | |
You need to watch that trick. | |
ID: 13147 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have 2 Leadtek Winfast GTX280's that died on 14 September '09, precisely after 6 months of crunching GPUGrid more or less 24/7 (less because my rigs are gaming rigs, so whenever me and my son were playing we would disable GPUGrid). | |
ID: 13160 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have 2 Leadtek Winfast GTX280's that died on 14 September '09, precisely after 6 months of crunching GPUGrid more or less 24/7 (less because my rigs are gaming rigs, so whenever me and my son were playing we would disable GPUGrid). could it be that the manufacturers cards don't pass the 24/7 full throttle test!!!! bfg GTX260, still going like the energizer bunny. | |
ID: 13168 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yeah, pretty sure that's actually the issue.
Hmmm, probably the reason I bought 3 BFG GTX285's a week ago :))) ____________ Semper ubi sub ubi. | |
ID: 13170 | Rating: 0 | rate:
![]() ![]() ![]() | |
i know that when my BFG GTX 260 FOC 216 SP.. gets to 75C it shutdown the computers b/c of overheating... maybe a BFG safety feature or faulty sensor but i actually like that about my card, means that it's safe until about 70 C... | |
ID: 13191 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi, | |
ID: 13235 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have a palit GTX260 and managed to OC it quite well. I found the sweet spot and kept it there for a while, but I now just go native because the noise was just a little too high for me to work at the system. | |
ID: 13271 | Rating: 0 | rate:
![]() ![]() ![]() | |
For those that keep their Fan at 100% - dont! It will drastically reduce the life expectancy of the fan, and could take out the GPU. .. but it helps the GPU and the rest of the card ;) I will NEVER OC anything in my computer myself. For a good lifetime of your computer, buy good equipment and UNDERclock it (or leave it on stock ratings). OCed or underclocked doesn't mean much if you leave the fan on auto. OC'ed (stock voltage) 70°C will be much better for the card than underclocked 90°C (stock voltage). MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 13277 | Rating: 0 | rate:
![]() ![]() ![]() | |
i run stock settings for my FOC 216 260 from BFG... | |
ID: 13278 | Rating: 0 | rate:
![]() ![]() ![]() | |
You are quite right to say that it reduces the heat and increase the life expectancy of the cards other parts. Right up to the time when the fan fails, then it is a bit hit and miss how things turn out. | |
ID: 13279 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : Graphics cards (GPUs) : Video Card Longevity